A few months ago I wrote about this process a number of problems I had with importing fairly large and moderately complex HTML documents into MS Word. I’m posting this entry to update and resolve the issues that I ran into. Among them were image embedding problems and Word Lockups due to runaway memory usage.
Inside of our Help and Documentation System, Help Builder, I provide the ability to export a Help Project to MS Word. Help Builder accomplishes this task by generating a single very large HTML document from the Help Topics with a special simplified template. It then uses Word Automation to import this document into Word, and load it into a custom Word document that includes a few macros for formatting.
However, there are problems with a straight import. The most serious issue is that although I was able to import an HTML document with image references, Word does not embed these images. Rather it links the images as external links, that don’t travel with the Word file. Move the file to another machine without the images and the images won’t show any longer.
The first cut solution to this problem is to go through the document unlink the images. Unfortunately this process (described in the last entry) was not very straight forward. The problem is that Word did not embed all images – only some of them. A number of people made some great suggestions, but none of them solved the problem of not all images getting embedded.
The original code I had was fairly complex too – it went out deleted the image and pasted it back in with the proper embedding functionality, which was very slow. Another major problem was the fact that Word would eat up tons and tons of memory and choke on very large (1000+ page) documents.
So, yesterday I spent a few hours trying to figure out exactly what’s going on. Here’s what I came up with, which works actually very well now.
' This macro replaces external image links with
' embedded images so the document is self-contained
' This macro must be run while the images are in place
x = 5
For Each oField In ActiveDocument.Fields
If oField.Type = wdFieldIncludePicture Then
x = x + 1
If x = 6 Then
x = 0
oField.LinkFormat.SavePictureWithDocument = True
x = 2
For Each oImage In ActiveDocument.Shapes
If Not (oImage.LinkFormat Is Nothing) Then
x = x + 1
If x = 3 Then
x = 0
oImage.LinkFormat.SavePictureWithDocument = True
The new routine above addresses the following issues I had with my original code:
Image Embedding Problems
The problem I had previously was that Word would miss some of the images while trying to embed. I used the Fields collection, but not all images are part of this collection. Apparently the problem is that Word does not treat all images the same. Some images are treated as fields while others are treated as shapes. I’m not sure what causes this difference actually, but I think it has to do with images in the HTML having additional alignment attributes. Specifically a lot of images in the HTML/Help file would have things like HSPACE=5 or ALIGN="left" which would cause Word to import the images as Shapes as opposed to fields for just plain IMG tags.
The solution to this particular problem was to go through both the Fields and Shapes collections. So the code now loops over the Fields collection with a filter for wdIncludePicture and the Shapes collection, which captures all images as far as I can tell.
FWIW, you can check real easy whether this worked by going to the Edit | Link menu option in Word. After you run this macro the Links option should be disabled as all links have been unlinked and embedded!
Memory and Word Lock-up Problems – Clear the Undo Buffer!
When I created this at first it worked great on my small help files but as soon as I threw larger files at it, Word would lock up. Lock up HARD to the point of keeping the machine at 100% CPU where not even task manager would come up to let me kill Word. Eventually I got lucky and Word through me a bone: An error message came up saying that there was a very large number of undo objects which were eating up memory. I’ve only seen this error once, but that’s all that was needed to make this problem go away. So now after every update to the document I simply clear the Undo buffer and voila - Word now runs within a reasonable amount of memory (which is still approaching 50 megs, but that’s presumably because there’s enough memory available and a lot less than the 150+ megs I saw before).
The code from the previous entry was pretty complex and did a lot of funky things because I was simply not aware on how to talk to link manager. Originally I used the Macro recorder to figure out what to do which essentially was cutting the image and adding it back in with some different options. BLog to the rescue: PeterB left a very useful comment that mentioned unlinking, which lead me to find the Linkformat object. This object contains all information and operations that deal with object linking and embedding and
that simply lets me SavePictureWithDocument and then BreakLink() on the images.
Some ‘user interface’
You may notice and wonder about the use of oImage.Select and oField.Select which actually selects each image. This is not required, but I use this for ‘making something happen’ on the screen. The image replacement is still pretty slow for my big help files and can take up to a minute. The selection makes Word scroll to each imageThe import which makes it obvious that Word is not in fact locked up. This makes the process a little slower but I found that it makes it a little easier to wait for the conversion to get done.
Calling the code from VFP via COM: Importing a Word document
So for completeness sake, here’s the actual COM code that performs the task of generating the HTML and importing into Word, calling the macros and activating Word:
CASE lcVerb == "ExportWord"
*** Generate the HTML string
lcHTML = THISFORM.PrintTopics("",.T.)
lcFile = FORCEEXT( THISFORM.oHelp.cFileName ,"HTML")
THISFORM.nStatusType = 3
THISFORM.StatusMessage("Converting output into MS Word Document",,,"Exporting")
WAIT WINDOW "" TIMEOUT .5
LOCAL oWord as Word.Application, oDoc as Word.Document
llError = .F.
oWord.VISIBLE = .F.
MESSAGEBOX("Unable to load the Word COM object:" + CHR(13) + CHR(13) +;
llError = .T.
*** Start by loading the HTML file
oDoc = oWord.Documents.OPEN(lcFile)
*** Select and copy the whole thing to the ClipBoard
*** Copy the template file
COPY FILE (THISFORM.oHelp.cProjPath + "templates\msword\helpbuildertemplate.doc") TO ;
( FORCEEXT(lcFile,"doc") )
THISFORM.StatusMessage("Updating MS Word Document",,,"Exporting")
WAIT WINDOW "" TIMEOUT .5
oDoc = oWord.Documents.OPEN(FORCEEXT(lcFile,"doc"))
oDoc.SAVEAS(FORCEEXT( lcFile, "doc" ))
*** Now replace the header text
oWord.ActiveWindow.ActivePane.VIEW.SeekView= 9 && wdSeekCurrentPageHeader
oWord.ActiveWindow.ActivePane.VIEW.SeekView= 0 && wdSeekCurrentPageHeader
*** Make visible in case macro restrictions are in place
oWord.VISIBLE = .T.
THISFORM.StatusMessage("Embedding images into MS Word Document",,,"Exporting")
*** Force Images to be embedded into the document
*** Force Help Builder to the top
THISFORM.AlwaysOnTop = .t.
THISFORM.AlwaysOnTop = .f.
IF MESSAGEBOX("Export to Microsoft Word complete." + CHR(13) + ;
"Do you want to add a Table of Contents?",64+4,WWHELP_APPNAME)=6
THISFORM.StatusMessage("Adding Table of Contents",,,"Exporting")
WAIT WINDOW "" TIMEOUT .1
*** No Op - just fail - Word displays an error message
*** Go top in Word
THISFORM.StatusMessage("MS Word Conversion completed...")
*** Bring Word to the top
The code basically creates a new document that imports the HTML, then copies the HTML and pastes it into a pre-created Document template. The template contains several macros that then get run from via Automation calls to Word.Application.Run().
The result of all of this is a process that works really well even with very large documents. Word still hogs memory and CPU, but it no longer locks up. Images are embedded properly and life is good!
Other Posts you might also like