Document Decomposition of Bangla Printed Text
Today all kind of information is getting digitized and along with all this digitization, the huge archive of various kinds of documents is being digitized too. We know that, Optical Character Recognition is the method through which, newspapers and other paper documents convert into digital resources. But, it is a fact that this method works on texts only. As a result, if we try to process any document which contains non-textual zones, then we will get garbage texts as output. That is why; in order to digitize documents properly they should be prepossessed carefully. And while preprocessing, segmenting document in different regions according to the category properly is most important. But, the Optical Character Recognition processes available for Bangla language have no such algorithm that can categorize a newspaper/book page fully. So we worked to decompose a document into its several parts like headlines, sub headlines, columns, images etc. And if the input is skewed and rotated, then the input was also deskewed and de-rotated. To decompose any Bangla document we found out the edges of the input image. Then we find out the horizontal and vertical area of every pixel where it lies in. Later on the input image was cut according to these areas. Then we pick each and every sub image and found out their height-width ratio, line height. Then according to these values the sub images were categorized. To deskew the image we found out the skew angle and de skewed the image according to this angle. To de-rotate the image we used the line height, matra line, pixel ratio of matra line.
READ FULL TEXT