Using Tabula to Parse PDF Tables

Recently, I wrote about trying to figure out a way to automatically produce visualizations of STM32 peripheral mappings from their datasheets. Unfortunately, it didn’t go too well; I had trouble parsing the output from the command-line pdftotext utility, so I ended up having to manually clean up each peripheral table after it was half-parsed by a script.

I was thinking of trying to write a better parsing script, but before diving into that rabbit hole I took another look at open-source PDF-parsing programs and found Tabula. It is an MIT-licensed utility with one goal: extracting tables from PDF files. And it seems to work very well with ST’s datasheets.

Tabula has a nice local web interface which gives you previews of the table data that it will export from a PDF.

Step 1: Download Tabula

Tabula runs on Java, so it’s simple to set up on just about any platform. They have good installation instructions on their website and GitHub readme file. If you are using Linux, you can download and unzip the tabula-jar-<version>.zip version of the latest release to get a runnable JAR file.

Following the instructions in the README.md file, once you unzip the file and cd into its tabula/ directory, you can start running a local server with the command that the readme file suggests:

java -Dfile.encoding=utf-8 -Xms256M -Xmx1024M -jar tabula.jar

Once it starts up, you should be able to navigate to http://127.0.0.1:8080 in a browser to access the Tabula UI. There is also a command-line version of the project, which will probably be better for setting up an automatic process. But for now, it’s nice to use the visual selection tools which the UI provides.

Step 2: Parse a Datasheet

The Tabula UI makes it pretty easy to import files, but importing and working with large PDF files can also be slow and difficult to navigate. The process is easier if we only upload the pages of the datasheet which contain the relevant tables. There are dozens of ways to extract a few pages from a PDF file, but I used a tool called qpdf. Here’s an example command to extract pages 39-44 from a file:

qpdf --pages input.pdf 39-44 -- input.pdf output.pdf

Most PDF viewers will probably be able to save page ranges, too. Once you’ve extracted the pages of interest, you can upload the PDF to your local Tabula server like you would upload a file to a normal website:

Tabula’s index page

Once you’ve uploaded your file, click ‘Import’ and wait for the process to finish. Once it does, you should see a preview of the PDF file:

Tabula’s preview interface

You can click and drag to select rectangular areas for the program to process, so try selecting the entire table on the first page. I decided to omit the top rows, since they are repeated on each page:

Click and drag to select areas of tables which you want to transcribe.

Once you’ve selected the area of interest, you can click the ‘Repeat this Selection’ button (circled in purple) to select the same area on every other page. In this case, it’s a nice timesaver. But be sure to double-check each page to make sure the boundaries line up, especially the last one:

Tabula's "select" interface with footnotes excluded.

You can drag the edges of a selection to change it if covers part of the document you don’t want.

You can click and drag on the edge of a selection to modify it. Once you’re happy that your selections cover all of the table cells, click the ‘Preview & Export Extracted Data’ button.

Tabula’s default algorithm does the same thing as `pdftotext`.

Whoops, that doesn’t look quite right. In fact, it looks a lot like the output produced by pdftotext – most of the peripherals are on their own rows. This would still be easier to parse than the output of my Python scripts – you can find row boundaries by looking for the lack of a trailing , or / characters – but there’s an even better option. Click the ‘Lattice’ button (circled in purple above), and Tabula will look for lines instead of text layout to infer cell boundaries.

Tabula's "Lattice" algorithm can sometimes work when the default doesn't.

The ‘Lattice’ option seems to work great

Now you can just click the ‘Export’ button and you’ll get a CSV file. I added a ‘CSV to LaTeX’ script to the last post’s GitHub repository, with the same syntax as last time:

python3 pin_tabula_ingest.py <file>.csv <pin_column>

And like with the other script, the pin_column option is which of the datasheet’s packages you want to generate a table for; it’s a number counting up from 0 starting at the left-most column.

Conclusions

After running the Python script and converting the LaTeX files with pdflatex <file>.tex, you get similar PDF outputs to the last post which can be imported into Inkscape, Photoshop, etc.:

Example tables generated from Tabula's CSV output.

Tables generated from Tabula’s CSV files.

But using Tabula’s “Lattice” formatting option eliminates most of the the tedious and error-prone manual steps from the previous post. Talk about a relief – I didn’t really like the thought of editing more than a few of those garbled table outputs by hand. And now that the table-generation step is nearly automatic, it’s possible to make images for larger chips without getting bored to tears:

STM32L475VG pin:peripheral visualization

Leave a Reply Cancel reply

Simple USB / Serial Communication with the CP2102N

Several years ago, a company called Future Technology Devices International (FTDI) sold what may have been the most popular USB / Serial converter on the market at the time, called the FT232R. But this post is not about the FT232R, because that chip is now known for its sordid history. Year after year, FTDI enjoyed their successful chip’s market position – some would say that they rested too long on their laurels without innovating or reducing prices. Eventually, small microcontrollers advanced to the point where it was possible to program a cheap MCU to identify itself as an FT232R chip and do the same work, so a number of manufacturers with questionable ethics did just that. FTDI took issue with the blatant counterfeiting, but they were unable to resolve their dispute through the legal system to their satisfaction, possibly because most of the counterfeiters were overseas and difficult to definitively trace down. Eventually, they had the bright idea of publishing a driver update which caused the counterfeit chips to stop working when they were plugged into a machine with the newest drivers.

FTDI may have technically been within their rights to do that, but it turned out to be a mistake as far as the market was concerned – as a business case study, this shows why you should not target your customers in retaliation for the actions of a 3rd party. Not many of FTDI’s customers were aware that they had counterfeit chips in their supply lines – many companies don’t even do their own purchasing of individual components – so companies around the world started to get unexpected angry calls from customers whose toy/media device/etc mysteriously stopped working after being plugged into a Windows machine. You might say that this (and the ensuing returns) left a bad taste in their mouths, so while FTDI has since recanted, a large vacuum opened up in the USB / Serial converter market almost overnight.

Okay, that might be a bit of a dramatized and biased take, but I don’t like it when companies abuse their market positions. Chips like the CH340 and CH330 were already entering the low end of the market with ultra-affordable and easy-to-assemble solutions, but I haven’t seen them much outside of Chinese boards, possibly due to a lack of multilingual documentation or availability from Western distributors. So at least in the US, the most popular successor to the FT232R seems to have been Silicon Labs’ CP2102N.

It’s nice to have a cheap-and-cheerful way to put a USB plug which speaks UART onto your microcontroller boards, so in this post, I’ll review how to make a simple USB / UART converter using the CP2102N. The chip comes in 20-, 24-, and 28-pin variants – I’ll use the 24-pin one because it’s smaller than the 28-pin one and the 20-pin one looks like it has some weird corner pads that might be hard to solder. We’ll end up with a simple, small board that you can plug into a USB port to talk UART:

Drivers for the CP2102N are included in most popular OSes these days, including Linux distributions, so it’s mostly plug-and-play.

It’s worth noting that you can buy minimal CP2102N boards from AliExpress or TaoBao for about $1, but where’s the fun in that?

New 8-pin ARM Core: the STM32G031J6

It has been about nine months since ST released their new STM32G0 line of microcontrollers to ordinary people like us, and recently they released some new chips in the same linup. It sounds like ST wants this new line of chips to compete with smaller 8-bit micros such as Microchip’s venerable AVR cores, and for that market, their first round of STM32G071xB chips might be too expensive and/or too difficult to assemble on circuit boards with low dimensional tolerances.

Previously, your best bet for an STM32 to run a low-cost / low-complexity application was probably one of the cheaper STM32F0 or STM32L0 chips, which are offered in 16- and 20-pin TSSOP packages with pins spaced 0.65mm apart. They work great, but they can be difficult to use for rapid prototyping. It’s hard to mill or etch your own circuit board with tight enough tolerances, and it’s not very easy to solder the chips by hand. Plus, the aging STM32F031F6 still costs $0.80 each at quantities of more than 10,000 or so, and that’s pretty expensive for the ultra-cheap microcontroller market.

Pinout and minimal circuit for an STM32G031J6 – you only really need one capacitor if you have a stable 3.3V source.

Enter the STM32G031J6: an STM32 chip which comes in a standard SOIC-8 package with 32KB Flash, 8KB RAM, a 64MHz speed limit, and a $0.60 bulk price tag (closer to $1.20-1.40 each if you’re only buying a few). That all compares favorably to small 8-pin AVR chips, and it looks like they might also use a bit less power at the same clock speeds. Power consumption is a tricky topic because it can vary a lot depending on things like how your application uses the chip’s peripherals or what voltage the chip runs off of. But the STM32G0 series claims to use less than 100uA/MHz, and that is significantly less than the 300uA/MHz indicated in the ATTiny datasheets. Also, these are 32-bit chips, so they have a larger address space and they can process more data per instruction than an 8-bit chip can.

Considering how easy STM32 chips are to work with, it seems like a no-brainer, right? So let’s see how easy it is to get set up with one of these chips and blink an LED.

Mounting Solar Panels to a Car

As someone who likes both electronics and the outdoors, sometimes I get anxiety about a lack of electricity. It would be nice to go camping somewhere away from it all, and still be able to charge things and run some lights, a display, maybe a small cooler. I’m sure some of you are rolling your eyes at that, but I’ve also been wanting to play with adding aftermarket indicators to old cars, like backup sensors or blind spot warnings, and it’d be nice to run them off a separate battery to avoid the possibility of accidentally draining the car’s battery overnight.

Since low-power solar panels are fairly cheap these days, I figured that it might be worth buying a few to mount to my car’s roof. And since my car is technically a pickup, it was very easy to put the battery in the bed and run the wiring through the canopy’s front window:

Picture of the solar system's general setup

I’ve secured the battery a bit more since taking these pictures, but this is the basic idea – it’s pretty simple.

If you have a different kind of car, I’d imagine that you could just as easily put the battery in your trunk, but you might need to drill a hole for the wires if you don’t want to leave one of your windows cracked open.

I guess that a lot of this guide won’t apply exactly to your situation, because you’ll have different dimensions to work with, different limitations, and probably different solar panels. But I hope that laying out each step that I took and what worked for me might be helpful – your basic approach could probably look very similar.

And before we go any further, please keep your expectations in check. These panels can only produce up to 100W in direct sunlight, which is nowhere near enough power for something like an electric vehicle. So read on if this sounds interesting, but the car still runs on gas. We’re not saving the world here.