First page Back Continue Last page Text

Notes:


The future architecture of Cloudooo will introduce three new components: granulator, classifier and normalizer.

The purpose of a granulator is to extra sub-content of a larger content. For example, a PDF files contains paragraphs, images, tables. Some contents are explicitely defined in the PDF through typesetting instructions. It is possible to extract many structured content from a PDF file. Images may be converted to text through OCR (this is the role of the conversion Handler). Sophisticated OCRs can extract tables from the text content of an image. This is useful for example to extract the invoice lines from a scanned invoice.

The purpose of a classifier is to extract implicit metadata from a file. Implicit metadata can be the language, if not defined, the type of content (ex. poetry, marketing, etc.), the emotion displayed in a picture (happy, sad), etc.

The purpose of a normalizer is to find common language for metadata which is extracted from file content. For example, normalizing column names can help finding equivalent columns from one table to another.

Both 3 components may be used in relation with conversion Handler. A preliminary conversion to a base format may be done automatically before calling the appropriate granulator component. Also, certain ouput of granulation (ex. Images) maybe need additional conversion and resolution change. The same applies to classifier and normalizer in terms of input format.