Archive

Posts Tagged ‘extractor’

Tika chm extractor – LGPL alternative

Tika chm extractor

I’m pleased to announce that tika chm extractor LGPL licensed is released yesterday. Honestly, it’s not pure LGPL, only libraries it depends on, the rest of the code – Apache license version 2.0.

All relevant information can be found here.
Download the sources go to the github.

Why should it live?
Well, the “original” Tika’s extraction algorithm works pretty well in most of the cases, however, has “difficulties” in rare cases. Inventors of compressed html files by unknown reason couldn’t publish their specification thus the algorithm for extracting context from Tika chm parser is not perfect, but quite good.
Possible solution that crossed everybody’s mind, to use native libraries. Fare enough though. The only one question is in, how to make it working on multiple platforms. Aha! Having checked available options I figured out stable Java library called sevenzipjbind.

The extractor designed as stand alone program. I.e. is a server based on Jetty which listens to HTTP requests. Currently has three options: i. Extracts single file including metadata; ii. Extracts context & metadata from all files in the provided directory; iii. Extracts only metadata from single chm.
In addition, it saves extracted context & its metadata in special folder following the pattern : ../extracted_files/folder_name_as_file_name/extracted html files. Metadata goes under ../extracted_files/file_name.json

Examples how to use it you also can be found on github.

Please don’t hesitate to ask either by replying to this post, contacting me, or by sending a Twitter!

Advertisements
Categories: announcement Tags: , , ,