misteroleg

How Spring Integration can alleviate your life.

June 23, 2013 misteroleg Leave a comment

Some time ago I began a new job in big corporation. My first task was re-implement / re-import their C# tcp client to Java’s. Existed convertors have been sucking, so I did it manually. After week or so, freshy Java tcp client & server simulator have written & waited for further use. Having met with client’s requirements we found that Java’s implementation has a lack of important features such as: fail-over & auto-reconnection. Adding such functionality required from us add some untested code and might be insufficient flows in the business logic. One of our guys said, Aha, what if …? We can replace Java’s implementation to another one, for instance – Spring Integration. The rest of us smiled thinking what the heck? Anyway, my was is a good champ trying to take best technologies ever existed. We got a green light to do research & learn something exciting. To simplify our requirements I am going to show a simulator (aka server) & a client.

Before delving deeper, let me explain what Spring Integration intended for. As their site suggests: “it provides an extension of the Spring programming model to support the well-known Enterprise Integration Patterns”. Rephrasing, to design good enterprise application one could use a messaging more precisely asynchronous messaging that enables diverse applications to be integrated each other without nightmare or pain. One of wise guys named Martin Fowler has written famous book “Enterprise Integration Patterns”. Folk from Spring probably one day decided to materialize a theory in practice. Very pragmatic approach, isn’t it? Later you will see how wonderful fits for regular tasks. The main concept of SI is: Endpoint, Channel & Message.

Endpoint is a component which actually does something with a message. A message is a container consisting of header & payload. The header contains data that’s relevant to the messaging system where the payload contains the actual data. Channel connects two or more endpoints, it’s similar to Unix’ pipes. Two endpoints can exchange messages iff they’re connected through a channel. Pretty easy, isn’t it? The following diagram shows this.

The next step to our crash course will be defining requirements. I would say, we need a server (a tcp) & tcp client. We will write a simple application that will exchange a couple of messages with each other.

Important thing using SI is a configuration file which contains all necessary components that we going to use. Here is a “server” part of the configuration. Simplifying a model & SI lifecycle, Spring creates objects that defined in configuration xml. More generally such a concept called declarative programming. You define a business object in the xml, and a framework will generate appropriate classes for you, injects and initializes dependencies. The mantra says: you should be concentrated only on business and not on implementation.

Let’s define a part of the configuration xml, the server part.

http://pastebin.com/6AHQWPse

<int-ip:tcp-connection-factory id="tcpServerFactory" type="server" port="23234" single-use="false" serializer="byteArrayLenSerializer" deserializer="byteArrayLenSerializer" /> <int-ip:tcp-inbound-channel-adapter channel="serverIn" connection-factory="tcpServerFactory"/>

<int-ip:tcp-outbound-channel-adapter channel=”serverOut”
connection-factory=”tcpServerFactory”/>

Important things are: i. A factory (tcp-connection-factory) – creates tcp server using byte array length serializer. A serializer is needed for “packaging” our message by some way or encode it in order to transmit it over a wire. On the other hand deserializer is needed for “unpackaging” our message or decode it. Spring Integration has two factories one for client & another – for the server. The difference is – by type [server or client]. A port – means to listen to for incoming messages. IP address not mentioned here because a server runs as a localhost.

We also defined two channels: serverIn (for incoming messages) & serverOut (for outgoing messages). In order our server will send & receive messages we define inbound & outbound adapter which are associated with factory & channels. In our case it defines the endpoints. So, when message comes somewhat should take care of it. This responsibility takes a service, i.e. file sender service. If it accepts a message afterwards will send in background a file, line by line to the client. Basically, when a server starts, it listens for incoming messages however only specific message will be accepted and if that message is gotten, than server sends line by line a file. If an error occurs it’s routed to the error channel. It’s done using interceptor.

I would say a couple of words about SI lifecycle. Spring framework has two “main” packages: org.springframework.beans & org.springframework.context that builds up the core utility of the dependency injection of the component. The org.springframework.beans.factory.BeanFactory interface provide a basic lifecycle methods (start & stop) for bean initialization/destruction. The org.springframework.context.ApplicationContext offers AOP integration, message resource handling and even more.

Our server is ready, I mean, completely ready. To run the example follow the below steps:

cd /tcpserver

mvn clean install
mvn dependency:copy-dependencies
mvn exec:java -Dexec.mainClass=”org.example.tcpserver.ServerRunner” -Dexec.args=”–file=”/file_to_be_sent.txt””

Our main class expresses as follows:

CommandLinePropertySource clps = processProperties(args); /* Spring Integration context used to get desirable beans. */ AbstractApplicationContext context = new ClassPathXmlApplicationContext(new String[] {"server-config.xml"}, false); context.getEnvironment().getPropertySources().addFirst(clps); context.refresh(); context.registerShutdownHook();

The source code can be found here http://pastebin.com/6PMpWTfX.

Also we define a file send service:

String key = new String(appropriateData, "UTF-8"); LOG.info("got.message" + " [" + key + "]"); /* If message accepted */ if (key.contains(SEARCH_KEY)) { LogReader lr = new LogReader(sender, msg); lr.setPath2File(getFile().getAbsolutePath()); es.execute(lr); }

http://pastebin.com/icHRdQS3
Next, denote a business runner:

/* Creates an input stream to be read. */ fstream = new FileInputStream(getPath2File()); /* Wraps an input stream in order to be able reading of a whole line */ DataInputStream in = new DataInputStream(fstream); BufferedReader br = new BufferedReader(new InputStreamReader(in)); while ((line = br.readLine()) != null) { command = line; sendAndLog(timeToWait); }

http://pastebin.com/LZRdZ3Tg
Finally, for the server write an error handler which logs the errors:

public void handleRequestMessage(byte[] payload) { LOG.debug("Server got an error " + new String(payload)); }

http://pastebin.com/2EQvbVR8

Until now we’ve done with our server :-).

Now, let’s define a tcp client which will connect to the server, sends an accept message & gets a file sent from the server.

Our configuration file looks as follows:

http://pastebin.com/egquzq5q

 <int:gateway id="client" service-interface="org.example.tcpclient.TcpClientService" default-reply-channel="replyChannel" default-request-channel="requestChannel" default-reply-timeout="1000" default-request-timeout="1000"> </int:gateway>  <int:channel id="requestChannel"> <int:queue capacity="10" /> </int:channel>  <int:channel id="replyChannel" />

Here how to run a client:

Open a new terminal
cd /tcpclient
mvn clean install
mvn dependency:copy-dependencies
mvn exec:java -Dexec.mainClass=”org.example.tcpclient.ClientTcp”

Almost the same logic expresses here. Have a look.

A main class has the following lines:
/* Spring Integration context used to get desirable beans. */ AbstractApplicationContext context = new ClassPathXmlApplicationContext( new String[] { "client-config.xml" }, false); context.refresh(); context.registerShutdownHook(); TcpClientService service = context.getBean("client", TcpClientService.class); service.send("GIMMY");

http://pastebin.com/9mjmRyNk
In addition, define a client service:

void send(String txt);

Next, a message handler:

public void handle(byte[] s) { String ss = new String(s); LOG.info("r:" + ss); }

http://pastebin.com/Wg4mscvk
And the last one is an interceptor, which will inform your application about:

i. Message sent;

ii. A connection closed;

iii. A new connection added.

public void send(Message<?> message) throws Exception { super.send(message); LOG.debug("Sent message [" + new String((byte[]) message.getPayload()) + "]"); } public void close() { super.close(); LOG.debug("Closed connection"); }


public void addNewConnection(TcpConnection connection) {

super.addNewConnection(connection);

LOG.debug("Added new connection" + connection.getHostName() + ":" +

connection.getPort());

}

http://pastebin.com/wiDm5zbH

That’s it !!! 🙂

To play with the code, have a look at here http://www.4shared.com/zip/eF4q7l0k/spring_integration_example.html.

Prerequisites:

Java 1.6 or above;
Maven 3 or above;
Desire to learn something new & thrilling;

Pros:

A lot of features
Tested
Good & friendly community
If you have questions, the people really quickly reply
There are tons of examples
API is easy & comprehensive

Cos:

Takes time to learn & understand how to work with it.
If you got troubles, sometime it is difficult to debug it.

Peace be upon you.

Categories: tips and tricks, tutorials Tags: bi-directional, spring integration, tcp

Tika chm extractor – LGPL alternative

June 5, 2013 misteroleg Leave a comment

Tika chm extractor

I’m pleased to announce that tika chm extractor LGPL licensed is released yesterday. Honestly, it’s not pure LGPL, only libraries it depends on, the rest of the code – Apache license version 2.0.

All relevant information can be found here.
Download the sources go to the github.

Why should it live?
Well, the “original” Tika’s extraction algorithm works pretty well in most of the cases, however, has “difficulties” in rare cases. Inventors of compressed html files by unknown reason couldn’t publish their specification thus the algorithm for extracting context from Tika chm parser is not perfect, but quite good.
Possible solution that crossed everybody’s mind, to use native libraries. Fare enough though. The only one question is in, how to make it working on multiple platforms. Aha! Having checked available options I figured out stable Java library called sevenzipjbind.

The extractor designed as stand alone program. I.e. is a server based on Jetty which listens to HTTP requests. Currently has three options: i. Extracts single file including metadata; ii. Extracts context & metadata from all files in the provided directory; iii. Extracts only metadata from single chm.
In addition, it saves extracted context & its metadata in special folder following the pattern : ../extracted_files/folder_name_as_file_name/extracted html files. Metadata goes under ../extracted_files/file_name.json

Examples how to use it you also can be found on github.

Please don’t hesitate to ask either by replying to this post, contacting me, or by sending a Twitter!

Categories: announcement Tags: alternative, chm, extractor, tika

OCR using Tesseract and ImageMagick as pre-processing task

December 19, 2012 misteroleg Leave a comment

While many applications today use direct data entry via keyboard, more and more of these will return to automated data entry. The reasons for this include the increased incidence of operator wrist problems from constant keying and the potential hazards of video display terminal emissions. Therefore any application imaginable is a candidate for OCR.

What are its Applications?

Automatic number plate recognition, is used by various police forces and as a method of electronic toll collection on pay-per-use roads, parking, car washing stations etc and cataloging the movements of traffic or individuals (quite popular in Central London).
Book scanning – digital books can be easily distributed, reproduced and read on-screen. Projects like Project Gutenberg, Google Book Search scan books on a large scale.
CAPTCHA – is a type pf challenge-response test used in computing as an attempt ensure that the response is not generated by a computer. Stands for Completely Automated Public Turing test to tell Computers and Humans Apart.
Computational linguistics – machine translation
Digital pen as well as digital paper
Digital mail room is an automation of incoming mail processes for classification and distribution of mail.
Handwriting – is a person’s particular and individual style of writing with pen or pencil. Every literate human has his own manner of writing. Graphology is the controversial study and analysis of handwriting especially in relation to human psychology. Sometimes it’s a part of hiring processing, from the candidate asked to write by hand about its familiar topic and after that send to the authorities for the psycho-analysis of the person.
Music OCR – intended to interpret sheet music or printed scores into editable and playable form.
Optical Mark Recognition – is a process of capturing human marked data from document forms such as surveys and tests.
Kurtzwiel – text-to-speech converter software program, which enables a computer to read electronic and scanned text aloud to visually-impaired people.

Principles of OCR Technology

Optical Character Recognition (OCR) systems may recognize machine print. Using pattern-matching technology, OCR translates the shapes and patterns of machine-made characters into corresponding computer codes. Though most advanced systems are able to recognize multiple fonts, they can process only standard fonts such as Times Roman and Arial. Once all characters in a given word are recognized, the word is compared against a vocabulary of potential answers for the final result.

Character recognition then segments lines of text or words into separate characters that are recognized by the makeup of their component shapes. Machine-printed letters are evenly spaced across, and up-and-down, a given page, allowing the OCR system to read the text one character at a time. Segmentation into single characters represents a critical recognition failure point for forms processing organizations, because OCR recognition technology requires high-quality images with excellent contrast, character and clarity. Any text that is less than perfect will cause even the most sophisticated OCR systems to return significant reductions in accuracy when processing degraded images.

How to choose an optimal product?

When discussing what an OCR product to choose, the number of criteria should be considered. What a price you’re ready to pay? What’s a quality of the product? How is it supported? And so on, and so on. Fortunately for us, such a product exists. It’s open source, very good quality, pretty well supported and still alive. It called tesseract-ocr. Why tesseract? Because it’s open source, it’s licensed ASFv2, because it’s one of the best, the support is pretty well via mailing-list, runs on multiple platforms, has wide range of build-in languages, stable and easily integrates with other systems.

This tutorial divided by:

Introduction to tesseract-ocr

Installation of tesseract 3.0.1 for Windows.

Extracting the text

Writing simple tesseract function using baseapi

Writing Java function that extracts text from given image using ProcessBuilder and tesseract.exe

Introduction to ImageMagic

Installation ImageMagic 6.6.9-8 for Windows

Checking the installation

Brief description what’s under the hood, useful command line utilities.

Java API to ImageMagic (http://im4java.sourceforge.net/)

Introduction to MSL

Writing simple MSL script

Tips

Conclusion

bibliography

Introduction to teseract-ocr

As WIKI suggests, in geometry, the tesseract, also called an 8-cell or regular octachoron or cubic prism, is the four-dimensional analog of the cube. The tesseract is to the cube as the cube is to the square. Just as the surface of the cube consists of 6 square faces, the hypersurface of the tesseract consists of 8 cubical cells. The tesseract is one of the six convex regular 4-polytopes.

In our case, the Tesseract OCR engine was one of the top 3 engines in the 1995 UNLV Accuracy test. Orignally developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado. Between 1995 and 2006 it had little work done on it, but it is probably one of the most accurate open source OCR engines available. The source code will read a binary, grey or color image and output text. Now Google takes care of it.

Tesseract Installation

During this tutorial we will use Windows box with Microsoft Visual Studio 2008 Express installed.

The installation is very simple, takes about 1 hour. You can use Ant script provided for running particular tasks or do it by yourself.

Let’s meet the tutorial requirements.

Install Microsoft Visual Studio 2008 Express (http://msdn.microsoft.com/en-us/express/future/bb421473)
Add vcbuild.exe to the classpath
Install Ant (http://ant.apache.org/bindownload.cgi)
Install SVN client (http://subversion.apache.org/packages.html)
Check the Java2SE 1.5/6 installation

Now, we are ready to step in the word of image processing.

Step 1.

Download the tessearct source files and data. You have two options to do it, 1. using svn or 2. using ant script provided.

If you chosen use an Ant, check the following properties first.

tesseract.dir – a path to the tesseract sources to be download
tesseract.dir.name – a folder name, i.e. ${tesseract.dir}/${tesseract.dir.name}

Just make sure it exists, or make it yourself. mkdir ….

Type:

ant svn

Ok, time to go drink a coffee or read the news.

Well, continuing using Ant script, type:

ant build

If all went good, you will be notified that all 60 projects successfully built.

Tesseract chipped with the following list of trained languages:

Arabic
Bulgarian
Catalan
Czech
Chinese simplified
Chinese traditional
Danish
German
Greek
English
Finnish
French
Hebrew
Hindi
Croatian
Hungarian
Indonesian
Italian
Japanese
Korean
Latvian
Lithuanian
Dutch
Norwegian
And more

Let’s see what we have inside.

tesseract – extracts text or characters from the image.Usage: tesseract imagename outputfile -l -psm configfile-l, -psm, configfile are optional. -l means language in ISO 639-3 standard (eng, rus, ell etc). -psm means pagesegmode, the following mode are available:

psm mode	Description
0	Orientation and script detection (OSD) only
1	Automatic page segmentation with OSD
2	Automatic page segmentation, but no OSD, or OCR
3	Fully automatic page segmentation, but no OSD. (Default)
4	Assume a single column of text of variable sizes
5	Assume a single uniform block of vertically aligned text
6	Assume a single uniform block of text
7	Treat the image as a single text line
8	Treat the image as a single word
9	Treat the image as a single word in a circle
10	Treat the image as a single character

cntraining – generates a normproto and pffmtable. Reads in a text file consisting of feature samples from a training page in the following format: FontName CharName NumberOfFeatureTypes(N). It then appends these samples into a separate file for each character. The name of file is: DirectoryName/FontName/CharName.FeatureTypeName. The DirectoryName can be specified via a command line argument. If not specified, it defaults to the current directory.

combine_tessdata – creates an unified traineddata file from different files produced by the training process.

Usage	Description
language_data_path_prefix (e.g. tessdata/eng.)	Combines all individual tessdata components (unicharset, DAWGs, classifier templates, ambiguities, language configs). The result will be a combined tessdata file lang_code.traineddata
-e	Extracts individual components from a combined trained data file. For instance, combine_tessdata -e tessdata/ell.traineddata
-o	Overwrites individual components of the given lang_code.traineddata file. Example:

combine_tessdata -o tessdata/ell.traineddata-uUnpacks all the components to the specified path. For instance,

combine_tessdata -u tessdata/ell.traineddata /home/$USER/temp/ell

mftraining – Separates training pages into files for each character. Strips from files only the features and there parameters of the feature type mf. Reads in a text file consisting of feature samples from a training page in the following format: FontName CharName NumberOfFeatureTypes(N). The result is a binary file used by the OCR engine.
unicharset_extractor – Extracts a character/ligature set. Given a list of box files on the command line, generates a file containing an unicharset, a list of all the characters. The file contains the size of the set on the first line, and then one unichar per line.Usage: unicharset_extractor [-D DIRECTORY] FILE…
wordlist2dawg – Generates a DAWG from word list file. Given a file that contains a list of words (one word per line) and generates the corresponding squished DAWG file.Usage: wordlist2dawg [-t | -l min_len max_len] word_list_file dawg_file unicharset_file

Often, people think that with OCR they can “crack” gotchas.

As example, run the following:

tesseract.exe ..\kor_data\gotcha.tif gotchaOutput.txt -l eng

For human being it’s easy to recognize what’s written (rondity describe.), however, look at output:

rmdwdescrbe.

It could not recognize the first word, white space. Only second word recognized perfectly. You can train you OCR be able take care of words like a first one, but that who produces such gotchas will change their algorithm and you fail again. In this case, don’t try harder.

Another example:

tesseract.exe ..\kor_data\fra.arial.g4.tif ..\kor_data\fra_output.txt -l fra

Observing the output you probably found that extracted text is quite good but not perfect. Some characters misunderstood. To fix that you need “add” these characters to the traineddata. This process well described in tesseract-ocr wiki (http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3).

In addition to batch processing tesseract-ocr makes possible integrate its capabilities with your program/product through basic c++ API. It’s well documented and easy to use. All basicapi sources located in ../api folder.

Here is an example:

#include "baseapi.h"

char* run_tesseract(const char* datapath, const char* language,
                    const unsigned char* imagedata,
                    int bytes_per_pixel, int bytes_per_line,
                    int left, int top, int width, int height) {

//Starts tesseract. Datapath must be the name of parent dir and must end in '/'.
TessBaseAPI::Init(datapath, language);

//Recognizes a rectangle from an image and returns the result as a string
char* text =
TessBaseAPI::TesseractRect(imagedata, bytes_per_pixel, bytes_per_line,
                           left, top, width, height);

//Closes down tesseract and frees up all memory
TessBaseAPI::End();

  return text;
}

Java code using ProcessBuilder looks like:

/**

* Returns a text extracted from image

* @param image – File, might be tiff, png or gpeg

* @param tesseractPath – where a tesseract executable is located

* @param iso639_3Lang – three long character String, for instance, fra

* @return extracted text

* @throws IOException

* @throws InterruptedException

publicstatic String getExtractedText(File image, String tesseractPath,

String iso639_3Lang)

throws IOException, InterruptedException

{

File outputFile = new File(image.getParentFile(), “output”);

StringBuffer buffer = new StringBuffer();

ProcessBuilder pb = new ProcessBuilder(tesseractPath + File.separator +

“tesseract”, image.getCanonicalPath(),

outputFile.getAbsolutePath(),

“-l”, iso639_3Lang);

pb.redirectErrorStream(true);

Process process = pb.start();

process.waitFor();

BufferedReader in = new BufferedReader(new InputStreamReader(

new FileInputStream(outputFile.getAbsolutePath() +

“.txt”), “UTF-8”));

String str;

while ((str = in.readLine()) != null) {

buffer.append(str).append(System.getProperty(“line.separator”));

}

in.close();

new File(outputFile.getAbsolutePath() + “.txt”).delete();

return buffer.toString();

}

Working with OCR, often you will want to prepare your data (images) before throwing to the OCR. It could be converting image format, increase/decrease an image resolution, reduce image noise. There are a lot of options to achieve that, GIMP (http://www.gimp.org/) – is free, mutual and cute! If you trying to automate the data preparation process look at ImageMagic (http://www.imagemagick.technocozy.com/).

ImageMagic

It can do: detect edges, add noise, capture a screen and more and more. I could not cover them here, however I’m going to cover a relevant part to the our ocr processing.

Format conversion;
Transformations;
Composite – not sure …
Image identification
MSL – Magic Scripting Language – not sure

Installation of the ImageMagic

IM supports wide range of platforms, from *Nix to the Windows. I suppose, throughout this tutorial you used Windows and let it be so.

If you use the Ant script provided, run:

ant im.http

This command will download the windows installer. The Ant properties are in build.properties files, change them according to your set-up.

Moreover, the MAGICK_HOMEenvironment variable should be set to the path where you previously extracted the ImageMagick files.

Verifying installation

convert logo: logo.miff
imdisplay logo.miff

ImageMagick core utilities

Utility name	Usage
Display	Intended to view an image, manage its functionality including load, print, write to file, zoom, copy a region, paste a region, crop, show histogram and even more.
Convert	Converts image formats. Can be used for making thumbnails, charcoal drawning, oil painting, morphing
Import	Used to capture the screen and writes it to the file. Can be specified a single window, the entire screen, or any portion of the screen
Animate	Shows animated formats or a sequence of images. Has a capability for color reduction to match the color resolution of the display.
Composite	Combines several separate images with the following schemes: Over, In, Out, Atop, Xor, Plus, Minus, Difference, Multiply and Bumpmap.
Montage	Arranges a group of images into a single image.
Mogrity	Applies transformations on images and unlike other utilities overwrites the result on the original image.
Conjure	Magick Scripting Language (MSL), XML-based language using Conjure to perform any image processing activity without Perl interpreter.
Identity	Detects more information about an image format, such as file size, width, height, mapped color, number of colors and can detect if an image is corrupted.

ImageMagick has unbelievable number of interfaces, you can choose whatever you want. In this tutorial we will use Java API – im4java (http://im4java.sourceforge.net/).

Convert usage, options and image operators

Usage: convert.exe [options …] file [ [options …] file …] [options …] file

Options – Image Settings:

adjoin	joins images into a single multi-image file
-affine matrix	affine transform matrix
-alpha option	activates, deactivates, resets, or sets the alpha channel
-antialias	removes pixel-aliasing
-authenticate password	deciphers image with this password
-attenuate value	lessens (or intensify) when adding noise to an image
-background color	background color
-bias value	adds bias when convolving an image
-black-point-compensation	uses black point compensation
-blue-primary point	chromaticity blue primary point
-bordercolor color	border color
-caption string	assigns a caption to an image
-channel type	applies option to select image channels
-colors value	preferred number of colors in the image
-colorspace type	alternates image colorspace
-comment string	annotates image with comment
-compose operator	sets image composite operator
-compress type	type of pixel compression when writing the image
-define format:option	defines one or more image format options
-delay value	displays the next image after pausing
-density geometry	horizontal and vertical density of the image
-depth value	image depth
-direction type	renders text right-to-left or left-to-right
-display server	gets image or font from this X server
-dispose method	layers disposal method
-dither method	applies error diffusion to image
-encoding type	text encoding type
-endian type	endianness (MSB or LSB) of the image
-family name	renders text with this font family
-fill color	color to use when filling a graphic primitive
-filter type	uses this filter when resizing an image
-font name	renders text with this font
-format “string”	output formatted image characteristics
-fuzz distance	colors within this distance are considered equal
-gravity type	horizontal and vertical text placement
-green-primary point	chromaticity green primary point
-intent type	type of rendering intent when managing the image color
-interlace type	type of image interlacing scheme
-interline-spacing value	sets the space between two text lines
-interpolate method	pixel color interpolation method
-interword-spacing value	sets the space between two words
-kerning value	sets the space between two letters
-label string	assigns a label to an image
-limit type value	pixel cache resource limit
-loop iterations	adds Netscape loop extension to your GIF animation
-mask filename	associates a mask with the image
-mattecolor color	frame color
-monitor	Monitors progress
-orient type	image orientation
-page geometry	size and location of an image canvas (setting)
-ping	efficiently determines image attributes
-pointsize value	font point size
-precision value	maximum number of significant digits to print
-preview type	image preview type
-quality value	JPEG/MIFF/PNG compression level
-quiet	suppresses all warning messages
-red-primary point	chromaticity red primary point
-regard-warnings	Pays attention to warning messages
-remap filename	Transforms image colors to match this set of colors
-respect-parentheses	settings remain in effect until parenthesis boundary
-sampling-factor geometry	horizontal and vertical sampling factor
-scene value	image scene number
-seed value	Seeds a new sequence of pseudo-random numbers
-size geometry	width and height of image
-stretch type	renders text with this font stretch
-stroke color	graphic primitive stroke color
-strokewidth value	graphic primitive stroke width
-style type	Renders text with this font style
-synchronize	synchronize image to storage device
-taint	Declares the image as modified
-texture filename	name of texture to tile onto the image background
-tile-offset geometry	tiles offset
-treedepth value	color tree depth
-transparent-color color	transparent color
-undercolor color	annotation bounding box color
-units type	the units of image resolution
-verbose	prints detailed information about the image
-view	FlashPix viewing transforms
-virtual-pixel method	virtual pixel access method
-weight type	Renders text with this font weight
-white-point point	chromaticity white point

Image Operators:

-adaptive-blur geometry	adaptively blur pixels; decrease effect near edges
-adaptive-resize geometry	adaptively resizes image using ‘mesh’ interpolation
-alpha option	on, activate, off, deactivate, set, opaque, copy
-annotate geometry text	annotate the image with text
-auto-gamma	automagically adjusts gamma level of image
-auto-level	automagically adjusts color levels of image
-auto-orient	automagically orients (rotates) image
-bench iterations	Measures performance
-black-threshold value	forces all pixels below the threshold into black
-blue-shift factor	Simulates a scene at nighttime in the moonlight
-blur geometry	Reduces image noise and reduce detail levels
-border geometry	Surrounds image with a border of color
-border geometry	Surrounds image with a border of color
-bordercolor color	border color
-brightness-contrast geometry	improves brightness / contrast of the image
-cdl filename	color correct with a color decision list
-charcoal radius	Simulates a charcoal drawing
-chop geometry	Removes pixels from the image interior
-clamp	Restricts pixel range from 0 to the quantum depth
-clip	Clips along the first path from the 8BIM profile
-clip-mask filename	Associates a clip mask with the image
-clip-mask filename	Associates a clip mask with the image
-clip-path id	Clips along a named path from the 8BIM profile
-colorize value	Colorizes the image with the fill color
-color-matrix matrix	Applies color correction to the image
-contrast	Enhances or reduce the image contrast
-contrast-stretch geometry	Improves contrast by `stretching’ the intensity range
-convolve coefficients	Applies a convolution kernel to the image
-cycle amount	Cycles the image colormap
-decipher filename	converts cipher pixels to plain pixels
-deskew threshold	straightens an image
-despeckle	Reduces the speckles within an image
-distort method args	distort images according to given method ad args
-draw string	Annotates the image with a graphic primitive
-edge radius	Applies a filter to detect edges in the image
-encipher filename	Converts plain pixels to cipher pixels
-emboss radius	Embosses an image
-equalize	Performs histogram equalization to an image
-evaluate operator value	evaluates an arithmetic, relational, or logical expression
-extent geometry	Sets the image size
-extract geometry	Extracts area from image
-fft	implements the discrete Fourier transform (DFT)
-flip	Flips image vertically
-floodfill geometry color	Floodfills the image with color
-flop	Flops image horizontally
-frame geometry	Surrounds image with an ornamental border
-function name parameters	Applies function over image values
-gamma value	level of gamma correction
-gaussian-blur geometry	Reduces image noise and reduce detail levels
-geometry geometry	preferred size or location of the image
-identify	Identifies the format and characteristics of the image
-ift	implements the inverse discrete Fourier transform (DFT)
-implode amount	Implodes image pixels about the center
-lat geometry	local adaptive thresholding
-layers method	optimizes, merges, or compares image layers
-level value	Adjusts the level of image contrast
-level-colors color,color	Levels image with the given colors
-linear-stretch geometry	Improves contrast by `stretching with saturation’
-liquid-rescale geometry	Rescales image with seam-carving
-median geometry	Applies a median filter to the image
-mode geometry	Makes each pixel the ‘predominate color’ of the neighborhood
-modulate value	Varies the brightness, saturation, and hue
-monochrome	transforms image to black and white
-morphology method kernel	Applies a morphology method to the image
-motion-blur geometry	Simulates motion blur
-negate	Replaces every pixel with its complementary color
-noise geometry	adds or reduces noise in an image
-normalize	Transforms image to span the full range of colors
-opaque color	Changes this color to the fill color
-ordered-dither NxN	Adds a noise pattern to the image with specific amplitudes
-paint radius	Simulates an oil painting
-polaroid angle	Simulates a Polaroid picture
-posterize levels	Reduces the image to a limited number of color levels
-profile filename	adds, deletes, or applies an image profile
-quantize colorspace	Reduces colors in this colorspace
-radial-blur angle	radial blurs the image
-raise value	Lightens/darkens image edges to create a 3-D effect
-random-threshold low,high	random thresholds the image
-region geometry	Applies options to a portion of the image
-render	Renders vector graphics
-repage geometry	size and location of an image canvas
-resample geometry	Changes the resolution of an image
-resize geometry	Resizes the image
-roll geometry	Rolls an image vertically or horizontally
-rotate degrees	Applies Paeth rotation to the image
-sample geometry	Scales image with pixel sampling
-scale geometry	Scales the image
-segment values	Segments an image
-selective-blur geometry	selectively blurs pixels within a contrast threshold
-sepia-tone threshold	simulates a sepia-toned photo
-set property value	Sets an image property
-shade degrees	Shades the image using a distant light source
-shadow geometry	Simulates an image shadow
-sharpen geometry	Sharpens the image
-shave geometry	Shaves pixels from the image edges
-shear geometry	Slides one edge of the image along the X or Y axis
-sigmoidal-contrast geometry	Increases the contrast without saturating highlights or shadows
-sketch geometry	Simulates a pencil sketch
-solarize threshold	Negates all pixels above the threshold level
-sparse-color method args	fills in a image based on a few color points
-statistic type geometry	Replaces each pixel with corresponding statistic from the neighborhood
-strip	Strips image of all profiles and comments
-swirl degrees	Swirls image pixels about the center
-threshold value	Thresholds the image
-thumbnail geometry	Creates a thumbnail of the image
-tile filename	Tiles image when filling a graphic primitive
-tint value	Tints the image with the fill color
-transform	affine transforms image
-transpose	Flips image vertically and rotate 90 degrees
-transverse	Flops image horizontally and rotate 270 degrees
-trim	Trims image edges
-type type	image type
-unique-colors	Discards all but one of any pixel color
-unsharp geometry	Sharpens the image
-vignette geometry	Softens the edges of the image in vignette style
-wave geometry	Alters an image along a sine wave
-white-threshold value	force all pixels above the threshold into white

Image Sequence Operators:

-append	Appends an image sequence
-clut	Applies a color lookup table to the image
-coalesce	Merges a sequence of images
-combine	Combines a sequence of images
-composite	Composites image
-crop geometry	Cuts out a rectangular region of the image
-deconstruct	Breaks down an image sequence into constituent parts
-evaluate-sequence operator	Evaluates an arithmetic, relational, or logical expression
-flatten	Flattens a sequence of images
-fx expression	Applies mathematical expression to an image channel(s)
-hald-clut	Applies a Hald color lookup table to the image
-morph value	Morphs an image sequence
-mosaic	Creates a mosaic from an image sequence
-print string	Interprets string and print to console
-process arguments	Processes the image with a custom image filter
-separate	Separates an image channel into a grayscale image
-smush geometry	Smashes an image sequence together
-write filename	Writes images to this file

Image Stack Operators:

-clone indexes	Clones an image
-delete indexes	Deletes the image from the image sequence
-duplicate count,indexes	Duplicates an image one or more times
-insert index	Inserts last image into the image sequence
-reverse	Reverses image sequence
-swap indexes	Swaps two images in the image sequence

Miscellaneous Options:

-debug events	Displays copious debugging information
-help	Prints program options
-list type	Prints a list of supported option arguments
-log format	Formats of debugging information
-version	Prints version information

Here is a java code that converts image from jpeg to tiff.

publicstaticvoid main(String[] args) throws IOException,

InterruptedException,

IM4JavaException

{

String searchPath = “E:/image_magick”;

String sourceImage = “data/imade_art2.jpg”;

String destImage = “data/imade_art2.tiff”;

IMConvertCmd.tryExample(searchPath, sourceImage, destImage);

}

/**

* Creates ConvertCmd, sets search path, sets command, runs convert command,

* creates IMOperation, adds to it an image, runs identify and verbose commands

* @param searchPath – where ImageMagic exe’s placed

* @param sourceImage – a source image

* @param destImage – a destination image to be converted

* @throws IOException

* @throws InterruptedException

* @throws IM4JavaException

publicstaticvoid tryExample(String searchPath, String sourceImage,

String destImage) throws IOException,

InterruptedException,

IM4JavaException

{

ConvertCmd convertCmd = new ConvertCmd();

convertCmd.setSearchPath(searchPath);

convertCmd.setCommand(sourceImage, destImage);

convertCmd.run(new IMOperation());

IMOperation op = new IMOperation();

op.addImage(destImage);

IMOps ops = op.identify().verbose();

convertCmd.run(ops);

}

There is another cool thing called MSL. Stands for Magick Scripting Language basically XML language, intends for those who want to accomplish custom image processing tasks without programming. The interpreter is called conjure. The scripts looks as typical XML file with specialized tags in it and file extension msl.

An example of MSL:

    <?xml version="1.0" encoding="UTF-8"?>
    <image size="116x28" >
      <read filename="imade_art2.jpg" />
      <get width="base-width" height="base-height" />
      <resize geometry="%[dimensions]" />
      <get width="width" height="height" />
      <print output=
        "Image sized from %[base-width]x%[base-height]
         to %[width]x%[height].\n" />
      <write filename="imade_art2.png" />
    </image>

To invoke this script:

conjure -dimensions 116x28 firstMSL.msl

Magick Scripting Language (MSL) defines the following elements and their attributes:

tag/element	Attribute description/option(s)
<image>	Define a new image object. </image> – Destroys it.
<group>	Defines a new group of image objects. By default, images are only valid for the life of their <image> element. However, in a group, all images in that group will stay around for the life of the group.
<read>	Reads a new image from the disk.
<write>	Writes the image(s) to disk, either as single or multiple ones if necessary.
<get>	Gets any recognized attribute and stores it as an image attribute for later use. Currently only width and height are supported.
<set>	Sets background, bordercolor, clip-mask, colorspace, density, magick, mattecolor and opacity.
<border>	Surrounds the image with a border color. Options: fill, geometry, height, width
<blur>	Reduces image noise and reduces detail levels. Options: radius, sigma
<charcoal>	Simulate a charcoal drawing. Options: radius, sigma
<chop>	Removes pixels from the interior of an image. Options: geometry, height, width, x, y
<crop>	Cuts out one or more rectangular regions of the image. Options: geometry, height, width, x, y
<despeckle>	Remove “pepper” from an image
<emboss>	Replaces each pixel of an image by a highlight or a shadow, depending on light/dark boundaries on the original image.
<enhance>	Removes blurring and noise, increases contrast and reveals details.
<equalize>	Applies a histogram equalization to the image
<flip>	Creates a mirror image, reflecting the scanlines in the vertical direction.
<flop>	Creates a mirror image, reflecting the scanlines in the horizontal direction.
<frame>	Surrounds the image with a border or beveled frame. Options: fill, geometry, height, width, x, y, inner, outer
<get>	Options: height, width
<magnify>	Scales the image to twice its size
<minify>	Scales the image to half its size
<normalize>	Enhances the contrast of a color image
<read>	Reads the input image
<resize>	Resizes an image. Options: blur, filter, geometry, height, width
<roll>	Rolls an image vertically or horizontally. Options: geometry, x, y
<rotate>	Applies Paeth image rotation. Options: degrees
<sample>	Changes the image size simply by directly sampling the pixels of original image. Options: geometry, height, width
<scale>	Changes the image size by replacing pixels by averaging pixels together when minifying or replacing pixels when magnifying. Options: geometry, height, width
<sharpen>	Uses a Gaussian operator of the given radius and standard deviation (sigma). Options: radius, sigma
<shave>	Removes pixels from the image edges. Options: geometry, height, width
<solarize>	Negates all pixels above the threshold level. Options: threshold
<spread>	Displaces image pixels by a random amount. Options: radius
<stegano>	Hides watermark within an image. Options: image
<stereo>	Generates stereogram of two images (one for each eye). Options: image
<swirl>	Swirls image pixels about the center. Options: degrees
<texture>	Tiles texture onto the image background. Options: image
<threshold>	Applies simultaneous black/white threshold to the image. Options: threshold
<transparent>	Makes [this] color transparent within the image. Options: color
<trim>	Removes any edges that are exactly the same color as the corner pixels.

In this short tutorial I could not include all ImageMagick utilities, so you welcome check them out by yourself.

Tips

During using tesseract I’ve been making wrong decisions. One of them was using Cygwin. I wasted about two days trying to compile the sources, adding more and more missed libraries, recompiling again. Finally, I got my executables. However some of the features still were not working. Having decided remove all cygwin “mess”, and installed Microsoft Visual studio 2008 Express solved my troubles. It took only about 2 hours including installation of MS VS and compiling entire solution. It worked as a charm!
The next challenge was to install the ImageMagick. I thought that having MS VS 2008 installed I wouldn’t have problems compiling the sources. I was wrong again. The ImageMagick has dependencies on MSF library and cannot be compiled using MS VS 2008 Express, i.e. this library is out. The other option was to install ImageMagick binary distro. And it worked. Or you can install Visual Studio 6. It’s up to you.
Before using tesseract I encourage you to read its FAQ and Wiki. If you have some question(s) subscribe to the tesseract mailing-list. There are excellent people that can help you. Unlike opening tickets requesting support and waiting days or even months, here help comes very quick.

Conclusion

Usually OCR contains two stages. In the first stage we prepare our data to be processed. Some images have a noise, others poorly scanned or their format do not fit to our purposes. ImageMagick helps us to perform such kind of preparation aiming create scripts that automate the process. In the second stage, we actually do an OCR. Tesseract has a baseapi that make easier to integrate its capabilities with an environment.

Building your system, keep in mind OCR’s limitations.

If you have comments/suggestions please share it with me and other people.

Have a fun!

Categories: tutorials Tags: imagemagic, ocr using tesseract, tesseract

previous postngs archive

December 12, 2012 misteroleg Leave a comment

previous postngs archive

Here is some stuff a little bit outdated. Just for the record.

Categories: Uncategorized Tags: eclipse, icefaces, java tips

misteroleg

How Spring Integration can alleviate your life.

Tika chm extractor – LGPL alternative

OCR using Tesseract and ImageMagick as pre-processing task

previous postngs archive

Follow Blog via Email