Thursday, November 26, 2009

Converting a word document to use unicode characters

One of my friend recently had a problem like this. He wanted some word documents to be converted to Unicode. These files are in Sinhala but not in Sinhala unicode. They are in plain English letters but based on the font-type only they were displayed in Sinhala. What he wanted is to convert these documents to use Sinhala unicode characters. Since he is from Java background he wanted me to bring in some help as how to do it using .NET.

Our approach was something like this. Read thru the document, get all the words and check their font type , and if that font type is a font that uses English letters to be displayed in Sinhala(e.g. Kaputa, Thibas) we would replace the characters with a matching unicode character. Hope this approach is fair enough? If you guys could think of something smarter, please do let me know.

Here is what i do,

Declare some private members,

#region Private Members

Document wordDoc;
ApplicationClass wordApplication = new ApplicationClass();


In a button event, open a dialog box and get the document to be converted;

DialogResult dialogResult = openFileDialog1.ShowDialog();
object documentPath;
object o_null = System.Reflection.Missing.Value;

if (dialogResult == DialogResult.OK)
maskedTextBox1.Text = openFileDialog1.FileName;
documentPath = openFileDialog1.FileName;

wordDoc = wordApplication.Documents.Open(ref documentPath, ref o_null, ref o_null, ref o_null, ref o_null, ref o_null,
ref o_null, ref o_null, ref o_null, ref o_null, ref o_null,
ref o_null, ref o_null, ref o_null, ref o_null, ref o_null);


In another button event , do the convertion ;

Words words = wordDoc.Words;
IEnumerator enumerator = words.GetEnumerator();
object o_null = System.Reflection.Missing.Value;

while (enumerator.MoveNext())
Range range = enumerator.Current as Range;
Microsoft.Office.Interop.Word.Font font = range.Font;

string word = range.Text.Trim();
char[] characters = word.ToCharArray();
char[] newCharArray = new char[characters.Length];

for(int i=0 ; i < characters.Length ; i++)
//Should take the proper character from the mapping
newCharArray[i] = 'S';
string newWord = new string(newCharArray);

if (!string.IsNullOrEmpty(word))
range.Text = range.Text.Replace(word, newWord);

Finally save the document as a new one in another event handler,

Object oSaveAsFile = (Object)"C:\\SampleDoc.doc";
object o_null = System.Reflection.Missing.Value;

wordDoc.SaveAs(ref oSaveAsFile, ref o_null, ref o_null, ref o_null, ref o_null, ref o_null,
ref o_null, ref o_null, ref o_null, ref o_null, ref o_null,
ref o_null, ref o_null, ref o_null, ref o_null, ref o_null);

No comments: