Things you tought you knew (but you didn't): How to read a doc or a docx file with Java

Today I learnt something.
Well, everyday we learn something. Today I learnt something I thought I already knew, but actually I didn't. I'm talking about reading a .doc file with Java.

A friend of mine asked me to process a collection of documents in order to get some kind of search engine application internal to this collection. Easy and cool, when I'm done programming I'll share it with you.

One of the steps in building up this application is, of course, reading the file in order to process them. I thought that reading a doc was exactly like reading a txt, but I was wrong, BufferedReader here can't help. Luckily for us exists an Apache library, POI, that is really useful for dealing with Office's files.

So, first of all you have to download this library from here.
Then you have to import the jar in your project (currently I'm working with Eclipse so I just added them into the Build Path as external Jars).

After that, the code you need to write for reading your file look like that:


import java.io.*;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class MyDocReader {
    public static void main(String[] args) {

        File file = null;
        WordExtractor extractor = null ;
        try {

            file = new File("path_to_doc.doc");
            FileInputStream f=new FileInputStream(file.getAbsolutePath());
            HWPFDocument doc=new HWPFDocument(f);
            extractor = new WordExtractor(doc);
            String [] pars = extractor.getParagraphText(); //this will create an 
//array with the paragraphs from the doc
           for(int i=0;i<pars.length;i++){
                 if(pars[i] != null){
                 //do what you need to do

            }
        }
catch(Exception e){
      e.printStackTrace(); 
}
  }
}
 
 
 

So, that's all you need to know for reading your doc files with Java. Of course the POI library offer a lot of more interesting features, someday we will look at them either!
Leave me a comment to let me know if this post was helpful or not!
Cheers!
 

Commenti