Home > announcement > Tika chm extractor – LGPL alternative

Tika chm extractor – LGPL alternative

Tika chm extractor

I’m pleased to announce that tika chm extractor LGPL licensed is released yesterday. Honestly, it’s not pure LGPL, only libraries it depends on, the rest of the code – Apache license version 2.0.

All relevant information can be found here.
Download the sources go to the github.

Why should it live?
Well, the “original” Tika’s extraction algorithm works pretty well in most of the cases, however, has “difficulties” in rare cases. Inventors of compressed html files by unknown reason couldn’t publish their specification thus the algorithm for extracting context from Tika chm parser is not perfect, but quite good.
Possible solution that crossed everybody’s mind, to use native libraries. Fare enough though. The only one question is in, how to make it working on multiple platforms. Aha! Having checked available options I figured out stable Java library called sevenzipjbind.

The extractor designed as stand alone program. I.e. is a server based on Jetty which listens to HTTP requests. Currently has three options: i. Extracts single file including metadata; ii. Extracts context & metadata from all files in the provided directory; iii. Extracts only metadata from single chm.
In addition, it saves extracted context & its metadata in special folder following the pattern : ../extracted_files/folder_name_as_file_name/extracted html files. Metadata goes under ../extracted_files/file_name.json

Examples how to use it you also can be found on github.

Please don’t hesitate to ask either by replying to this post, contacting me, or by sending a Twitter!

Categories: announcement Tags: , , ,
  1. No comments yet.
  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: