How to Extract Text from PDFs using Foxit’s REST APIs

How to Extract Text from PDFs using Foxit's REST APIs

Want to extract text from PDF files with just a few lines of Python? This guide shows how to use Foxit’s REST Extract API to pull text content from PDFs, ideal for search, automation, or AI workflows. From setting up credentials to searching for keywords across multiple files, this post walks through the full process with example code and GitHub demos.

How to Extract Text from PDFs using Foxit’s REST APIs

PDFs are an excellent way to store information—they combine text, images, and more in a perfectly laid-out, eye-catching design that fulfills every marketer’s wildest dreams. But sometimes you just need the text! There’s a variety of reasons you may want to convert a rich PDF document into plain text:

  • For indexing in a search engine
  • To search documents for keywords
  • To pass to generative AI services for introspection

Let’s take a look at the Extract API to see just how easy this is.

Start Here: Obtain Free Credentials to Use the Foxit API

Before we go any further, head over to our developer portal and grab a set of free credentials. This will include a client ID and secret values – you’ll need both to make use of the API.

Foxit PDF API Workflow Overview with Python

The API follows the same format as the rest of our PDF Services in that you upload your input, kick off the job, check the job’s status, and download the result. As we’ve covered this a few times now on the blog (see my introductory post, we’ll skip over the details of uploading the document and loading in credentials. Here’s the Python code we’ve demonstrated before showing this in action:

CLIENT_ID = os.environ.get('CLIENT_ID')
CLIENT_SECRET = os.environ.get('CLIENT_SECRET')
HOST = os.environ.get('HOST')

def uploadDoc(path, id, secret):
	
	headers = {
		"client_id":id,
		"client_secret":secret
	}

	with open(path, 'rb') as f:
		files = {'file': (path, f)}

		request = requests.post(f"{HOST}/pdf-services/api/documents/upload", files=files, headers=headers)
		return request.json()

doc = uploadDoc("../../inputfiles/input.pdf", CLIENT_ID, CLIENT_SECRET)
print(f"Uploaded pdf to Foxit, id is {doc['documentId']}")

Now let's get into the meat of the Extract API. The API takes three arguments:

  • The ID of the previously uploaded document.
  • The type of information to extract—either TEXT, IMAGE, or PAGE. In theory, it should be pretty obvious what these do, but just in case: TEXT returns the text contents of the PDF. IMAGE gives you a ZIP file of images from the PDF. PAGE returns a new PDF containing just the page you requested.
  • You can also pass in a page range, which can be a combo of specific pages and ranges. If you don’t include one, the entire PDF gets processed for extraction.

To make this simple to use, I've built a wrapper function that lets you pass these arguments:

def extractPDF(doc, type, id, secret, pageRange=None):
    
    headers = {
        "client_id":id,
        "client_secret":secret,
        "Content-Type":"application/json"
    }

    body = {
        "documentId":doc,
        "extractType":type
    }

    if pageRange:
        body["pageRange"] = pageRange 

    request = requests.post(f"{HOST}/pdf-services/api/documents/modify/pdf-extract", json=body, headers=headers)
    return request.json()

Literally, that's it. At this point, you get a task object back that – like with our other APIs – can be checked for completion, and once it’s done, the results can be downloaded. Since we're working with text, though, let's simplify and just grab the text as a variable:

def getResult(doc, id, secret):
    
    headers = {
        "client_id":id,
        "client_secret":secret
    }

    return requests.get(f"{HOST}/pdf-services/api/documents/{doc}/download", headers=headers).text
This utility method takes a document ID value and gets the textual content. Here’s how that code looks:
doc = uploadDoc("../../inputfiles/input.pdf", CLIENT_ID, CLIENT_SECRET)
print(f"Uploaded pdf to Foxit, id is {doc['documentId']}")

task = extractPDF(doc["documentId"], "TEXT", CLIENT_ID, CLIENT_SECRET)
print(f"Created task, id is {task['taskId']}")

result = checkTask(task["taskId"], CLIENT_ID, CLIENT_SECRET)
print(f"Final result: {result}")

text = getResult(result["resultDocumentId"], CLIENT_ID, CLIENT_SECRET)
print(text)
You can see the entire script on our GitHub. Running it will just give you a wall of text. Not terribly exciting. So, let’s make it exciting!

Searching PDFs for Keywords

Let’s iterate on the previous example for something that could be a bit more useful – given a set of input PDFs, extract the text from each and report if a certain keyword, or keywords are found. I’ll start by gathering a list of PDFs from a source directory. But you could imagine this coming from new files in a cloud storage provider, attachments in new emails, and so forth:
# Get PDFs from our input directory
inputFiles = list(filter(lambda x: x.endswith('.pdf'), os.listdir('../../inputfiles')))
Now, I’ll define a keyword. A more complex version of this would probably use a list of keywords, but we’ll keep it simple for now:
# Keyword to match on: 
keyword = "Shakespeare"
And now to actually do the work. Remember, we’ve already defined our methods, so the only thing changing here is the code calling them:
for file in inputFiles:
    
    doc = uploadDoc(f"../../inputfiles/{file}", CLIENT_ID, CLIENT_SECRET)
    print(f"Uploaded pdf, {file}, to Foxit, id is {doc['documentId']}")

    task = extractPDF(doc["documentId"], "TEXT", CLIENT_ID, CLIENT_SECRET)
    result = checkTask(task["taskId"], CLIENT_ID, CLIENT_SECRET)

    text = getResult(result["resultDocumentId"], CLIENT_ID, CLIENT_SECRET)
    if keyword in text:
        print(f"\033[32mThe pdf, {file}, matched on our keyword: {keyword}\033[0m")
    else:
        print(f"The pdf, {file}, did not match on our keyword: {keyword}")
    
    print("")
Given my set of inputs, there’s only one match. Here’s the output I received:
Output from the script showing documents that contained the keyword | Foxit APIs
You can find the complete source code for this on our GitHub repo.

What’s Next?

The demo here is fairly simple, but you could imagine it being expanded to include things like automatic routing of PDFs with matching keywords, email alerts, and so forth. As a reminder, when working with any process like this, you can cache the result of the extraction. Imagine a scenario where the important keywords may change in the future. Your code could store the result of the text extract to the file system (perhaps with the same name as the PDF but using `.txt` as the extension instead) and simply skip calling our API when the cache exists. Our API will miss you, but that’s ok.

If this all sounds exciting, be sure to check the docs for more information about the template language and API. Sign up for some free developer credentials and reach out on our developer forums with any questions.

Introducing PDF APIs from Foxit

Introducing PDF APIs from Foxit

Get started with Foxit’s new PDF APIs—convert Word to PDF, generate documents, and embed files using simple, scalable REST APIs. Includes sample Python code and walkthrough.

Introducing PDF APIs from Foxit

At the end of June, Foxit introduced a brand-new suite of tools to help developers work with documents. These APIs cover a wide range of features, including:

    • Convert between Office document formats and PDF files seamlessly
    • Optimize, manipulate, and secure PDFs with advanced APIs
    • Generate dynamic documents using Microsoft Word templates
    • Extract text and images from PDFs with powerful tools
    • Embed PDFs into web pages in a context-aware, controlled manner
    • Integrate with eSign APIs for streamlined signature workflows


These APIs are simple to use, and best of all, follow the “don’t surprise me” principal of development. In this post, I’m going to demonstrate one simple example – converting a Word document to PDF – but you can rest assured that nearly all the APIs will follow incredibly similar patterns. I’ll be using Python for my examples here, but will link to a Node.js version of the same example. And given that we’re talking REST APIs here, any language is welcome to join the document party. Let’s dive in.

Credentials

Before we go any further, head over to our developer portal and grab a set of free credentials. This will include a client ID and secret values you’ll need to make use of the API.

Don’t want to read all of this? You can also follow along by video:

API Flow

As I mentioned above, most of the PDF Services APIs will follow a similar flow. This comes down to:

  • Upload your input (like a Word document)
  • Kick off a job (like converting to PDF)
  • Check the job (hey, how ya doin?)
  • Download the result

Or, in pretty graphical format –

The great thing is, once you’ve completed one integration (this post focuses on converting Word to PDF), switching to another is easy—and much of your existing code can be reused. A lazy developer is happy developer! Let’s get started.

Loading Credentials

My script begins by loading the credentials and API root host via the environment:

CLIENT_ID = os.environ.get('CLIENT_ID')
CLIENT_SECRET = os.environ.get('CLIENT_SECRET')
HOST = os.environ.get('HOST')

It’s never a good idea to hard-code credentials in your code. But if you do it this one time, I won’t tell. Honest.

Uploading Your Input

As I mentioned, in this example we’ll be making use of the Word to PDF API. Our input will be a Word document, which we’ll upload to Foxit using the upload API. This endpoint is fairly simple – aside from your credentials, all you need to provide is the binary data of the input file. Here’s the method I created to make this process easier:

def uploadDoc(path, id, secret):
    
    headers = {
        "client_id":id,
        "client_secret":secret
    }

    with open(path, 'rb') as f:
        files = {'file': (path, f)}

        request = requests.post(f"{HOST}/pdf-services/api/documents/upload", files=files, headers=headers)
        return request.json()

And here’s how it’s used:

doc = uploadDoc("../../inputfiles/input.docx", CLIENT_ID, CLIENT_SECRET)
print(f"Uploaded doc to Foxit, id is {doc['documentId']}")

The upload API only returns one value, a documentId, which we can use in future calls.

Starting the Job

Each API operation is a job creator. By this I mean you call the endpoint and it begins your action. For Word to PDF, the only required input is the document ID from the previous call. We can build a nice little wrapper function like so:

def convertToPDF(doc, id, secret):
    
    headers = {
        "client_id":id,
        "client_secret":secret,
        "Content-Type":"application/json"
    }

    body = {
        "documentId":doc	
    }

    request = requests.post(f"{HOST}/pdf-services/api/documents/create/pdf-from-word", json=body, headers=headers)
    return request.json()

And then call it like so:

task = convertToPDF(doc["documentId"], CLIENT_ID, CLIENT_SECRET)
print(f"Created task, id is {task['taskId']}")

The result of this call, if no errors were found, isa taskId. We can use this to gauge how the job’s performing. Let’s do that now.

Job Checking

Ok, so the next part can be a bit tricky depending on your language of choice. We need to use the task status endpoint to determine how the job is performing. How often we do this, how quickly and so forth, will depend on your platform and needs. For our little sample script here, everything is running at once. I wrote a function that will check the status. If the job isn’t finished (whether successful or not), it pauses briefly before trying again. While this approach isn’t the most sophisticated, it should work well enough for basic testing:

def checkTask(task, id, secret):

    headers = {
        "client_id":id,
        "client_secret":secret,
        "Content-Type":"application/json"
    }

    done = False
    while done is False:

        request = requests.get(f"{HOST}/pdf-services/api/tasks/{task}", headers=headers)
        status = request.json()
        if status["status"] == "COMPLETED":
            done = True
            # really only need resultDocumentId, will address later
            return status
        elif status["status"] == "FAILED":
            print("Failure. Here is the last status:")
            print(status)
            sys.exit()
        else:
            print(f"Current status, {status['status']}, percentage: {status['progress']}")
            sleep(5)

As you can see, I’m using a while loop that—at least in theory—will continue running until a success or failure response is returned, with a five-second pause between each call. You can adjust that interval as needed—test different values to see what works best for your use case. Typically, most API calls should complete in under ten seconds, so a five-second delay felt like a reasonable default.

Each call to the endpoint returns a task status result. Here’s an example:

{
    'taskId': '685abc95a0d113558e4204d7', 
    'status': 'COMPLETED', 
    'progress': 100, 
    'resultDocumentId': '685abc952475582770d6917b'
}

The important part here is the status. But you could also use progress to give some feedback to the code waiting for results. Here’s my code calling this:

result = checkTask(task["taskId"], CLIENT_ID, CLIENT_SECRET)
print(f"Final result: {result}")

Downloading Your Result

The last piece of the puzzle is simply saving the result. If you noticed above, the task returned a resultDocumentId value. Taking that, and the [Download Document](NEED LINK) endpoint, we can build a utility to store the result like so:

def downloadResult(doc, path, id, secret):
    
    headers = {
        "client_id":id,
        "client_secret":secret
    }

    with open(path, "wb") as output:
        
        bits = requests.get(f"{HOST}/pdf-services/api/documents/{doc}/download", stream=True, headers=headers).content 
        output.write(bits)

And finally, call it:

downloadResult(result["resultDocumentId"], "../../output/input.pdf", CLIENT_ID, CLIENT_SECRET)
print("Done and saved to: ../../output/input.pdf")

And that’s it! While this script could certainly benefit from more robust error handling, it demonstrates the basic flow. As mentioned, most of our APIs follow this same logic.

Next Steps

Want the complete scripts? Get it on GitHub.

Want it in Node.js? Get it on GitHub.

Rather try this yourself? Sign up for a free developer account now. Need help? Head over to our developer forums and post your questions and comments.