Crowdsourcing + Image Matching

For my master’s report, I developed a workflow for transcribing large collections of scanned documents. The process begins by segmenting scanned documents into words of individual images (see Figure 1).

segmentedPage
Figure 1

 

 

Then I developed a crowdsourcing web application using CakePHP to allow volunteers to transcribe a portion of the words (see Figures 2 and 3).

available_pages_medium
Figure 2
transcribe_page_medium
Figure 3

 

 

Finally, I used the OpenCV libraries to match images of words by comparing their sizes, histograms and keypoint descriptors obtained with the SURF algorithm (See Figure 4). If the images matched, the text obtained from crowdsourcing is associated with the new image of a word.

matching_border_large
Figure 4

 

Share:
  • Print
  • Digg
  • del.icio.us
  • Facebook
  • Twitter

Leave a Reply