Multilingual Automatic Document Classification, Analysis and Translation (MADCAT)

Summary

Multilingual Automatic Document Classification, Analysis and Translation (MADCAT): The goal of the Multilingual Automatic Document Classification Analysis and Translation (MADCAT) Program is to automatically convert foreign language text images into English transcripts, thus eliminating the need for linguists and analysts while automatically providing relevant, distilled actionable information to military command and personnel in a timely fashion.

Warfighters encounter foreign language images in many forms, including captured paper documents and computer files. Given the quantity of foreign-language material and the scarcity of linguists, military personnel and analysts can find it difficult to identify, translate and interpret important information in a timely fashion. What these personnel and analysts have lacked to date is the capability to automatically and rapidly convert foreign-language text images into English transcripts that provide relevant, distilled and actionable information.

To help overcome these challenges, DARPA’s Multilingual Automatic Document Classification, Analysis and Translation (MADCAT) program has developed capabilities to automatically classify, analyze and interpret foreign-language text in document images. MADCAT technologies first recognize text in scanned hardcopy, or imaged scenes (e.g. signs, graffiti or televised text images), then translate the text into English for use by English-speaking military personnel and analysts.

MADCAT technologies are able to:

Analyze images to determine language and type of script
Classify images to determine the kind of material that each image presents (photo, newspaper article, technical memo, ledger, etc.)
Segment images and interpret different text zones, including classification and parsing of tables
Produce transcripts of images in their source languages, whether printed or handwritten
Produce accurate English translations of source language text

The program has developed optical character recognition and machine translation capabilities for 11 languages: Arabic, Chinese, Dari, Farsi, Hindi, Pashto, Spanish, Russian, Thai, Urdu and Korean.

MADCAT technologies are script- and language-independent, requiring only transcribed training documents. DARPA is working to transition MADCAT technologies into systems to enable English-speaking government and military personnel to read hard copies of foreign-language documents. The program has also started a project to further develop specialized Korean optical character recognition and machine translation systems in support of military user requirements.

Office

Information Processing Techniques Office

This program is now complete

This content is available for reference purposes. This page is no longer maintained.

Summary

Office

This program is now complete

Contact

Work with Us

R&D Opportunities

Programs

Offices

News

Events

Careers

About

Breadcrumb

MADCAT: Multilingual Automatic Document Classification, Analysis and Translation

Summary

Office

This program is now complete

Contact