Rick Strahl's Weblog  

Wind, waves, code and everything in between...
.NET • C# • Markdown • WPF • All Things Web
Contact   •   Articles   •   Products   •   Support   •   Advertise
Sponsored by:
West Wind WebSurge - Rest Client and Http Load Testing for Windows

Importing an HTML document into Word via COM Automation and dealing with Image Embedding (revisited)


:P
On this page:

A few months ago I wrote about this process a number of problems I had with importing fairly large and moderately complex HTML documents into MS Word. I’m posting this entry to update and resolve the issues that I ran into. Among them were image embedding problems and Word Lockups due to runaway memory usage.

 

Inside of our Help and Documentation System, Help Builder, I provide the ability to export a Help Project to MS Word. Help Builder accomplishes this task by generating a single very large HTML document from the Help Topics with a special simplified template. It then uses Word Automation to import this document into Word, and load it into a custom Word document that includes a few macros for formatting.

 

However, there are problems with a straight import. The most serious issue is that although I was able to import an HTML document with image references, Word does not embed these images. Rather it links the images as external links, that don’t travel with the Word file. Move the file to another machine without the images and the images won’t show any longer.

 

The first cut solution to this problem is to go through the document unlink the images. Unfortunately this process (described in the last entry) was not very straight forward. The problem is that Word did not embed all images – only some of them. A number of people made some great suggestions, but none of them solved the problem of not all images getting embedded.

 

The original code I had was fairly complex too – it went out deleted the image and pasted it back in with the proper embedding functionality, which was very slow. Another major problem was the fact that Word would eat up tons and tons of memory and choke on very large (1000+ page) documents.

 

So, yesterday I spent a few hours trying to figure out exactly what’s going on. Here’s what I came up with, which works actually very well now.

 

 

'***************************************************************************

'*** EmbedImages

'*****************

' This macro replaces external image links with

' embedded images so the document is self-contained

' This macro must be run while the images are in place

Sub EmbedImages()

     x = 5

     For Each oField In ActiveDocument.Fields

      If oField.Type = wdFieldIncludePicture Then

          x = x + 1

          If x = 6 Then

            oField.Select

            x = 0

            DoEvents

          End If

          oField.LinkFormat.SavePictureWithDocument = True

          oField.LinkFormat.BreakLink

          ActiveDocument.UndoClear

      End If

    Next

   

    x = 2

    For Each oImage In ActiveDocument.Shapes

        If Not (oImage.LinkFormat Is Nothing) Then

            x = x + 1

            If x = 3 Then

               oImage.Select

               x = 0

               DoEvents

            End If

            oImage.LinkFormat.SavePictureWithDocument = True

            oImage.LinkFormat.BreakLink

            ActiveDocument.UndoClear

        End If

    Next

 

End Sub

 

The new routine above addresses the following issues I had with my original code:

 

Image Embedding Problems

The problem I had previously was that Word would miss some of the images while trying to embed. I used the Fields collection, but not all images are part of this collection. Apparently the problem is that Word does not treat all images the same. Some images are treated as fields while others are treated as shapes. I’m not sure what causes this difference actually, but I think it has to do with images in the HTML having additional alignment attributes. Specifically a lot of images in the HTML/Help file would have things like HSPACE=5 or ALIGN="left"  which would cause Word to import the images as Shapes as opposed to fields for just plain IMG tags.

 

The solution to this particular problem was to go through both the Fields and Shapes collections. So the code now loops over the Fields collection with a filter for wdIncludePicture and the Shapes collection, which captures all images as far as I can tell.

 

FWIW, you can check real easy whether this worked by going to the Edit | Link menu option in Word. After you run this macro the Links option should be disabled as all links have been unlinked and embedded!

 

Memory and Word Lock-up Problems – Clear the Undo Buffer!

When I created this at first it worked great on my small help files but as soon as I threw larger files at it, Word would lock up. Lock up HARD to the point of keeping the machine at 100% CPU where not even task manager would come up to let me kill Word. Eventually I got lucky and Word through me a bone: An error message came up saying that there was a very large number of undo objects which were eating up memory. I’ve only seen this error once, but that’s all that was needed to make this problem go away. So now after every update to the document I simply clear the Undo buffer and voila - Word now runs within a reasonable amount of memory (which is still approaching 50 megs, but that’s presumably because there’s enough memory available and a lot less than the 150+ megs I saw before).

 

Simpler Code

The code from the previous entry was pretty complex and did a lot of funky things because I was simply not aware on how to talk to link manager. Originally I used the Macro recorder to figure out what to do which essentially was cutting the image and adding it back in with some different options. BLog to the rescue:  PeterB left a very useful comment that mentioned unlinking, which lead me to find the Linkformat object. This object contains all information and operations that deal with object linking and embedding and

 that simply lets me SavePictureWithDocument and then BreakLink() on the images.

 

Some ‘user interface’

You may notice and wonder about the use of oImage.Select and oField.Select which actually selects each image. This is not required, but I use this for ‘making something happen’ on the screen. The image replacement is still pretty slow for my big help files and can take up to a minute. The selection makes Word scroll to each imageThe import which makes it obvious that Word is not in fact locked up. This makes the process a little slower but I found that it makes it a little easier to wait for the conversion to get done.

 

Calling the code from VFP via COM: Importing a Word document

So for completeness sake, here’s the actual COM code that performs the task of generating the HTML and importing into Word, calling the macros and activating Word:

 

   CASE lcVerb == "ExportWord"

 

      *** Generate the HTML string

      lcHTML = THISFORM.PrintTopics("",.T.)

 

      lcFile = FORCEEXT( THISFORM.oHelp.cFileName ,"HTML")

      STRTOFILE(lcHTML,lcFile)

 

      THISFORM.nStatusType = 3

      THISFORM.StatusMessage("Converting output into MS Word Document",,,"Exporting")

      DOEVENTS

      WAIT WINDOW "" TIMEOUT .5

 

      LOCAL oWord as Word.Application,  oDoc as Word.Document  

      llError = .F.

      TRY

         oWord=CREATE("word.application")

         oWord.VISIBLE = .F.

        

         DoEvents

      CATCH

         MESSAGEBOX("Unable to load the Word COM object:" + CHR(13) + CHR(13) +;

                    MESSAGE())

         llError = .T.

      ENDTRY

      IF llError

        RETURN

      ENDIF

     

      *** Start by loading the HTML file

      oDoc = oWord.Documents.OPEN(lcFile)

 

      DOEVENTS

     

      *** Select and copy the whole thing to the ClipBoard

      oWord.SELECTION.WholeStory

      oWord.SELECTION.COPY

      oDoc.CLOSE()

 

      *** Copy the template file

      COPY FILE (THISFORM.oHelp.cProjPath + "templates\msword\helpbuildertemplate.doc") TO ;

                ( FORCEEXT(lcFile,"doc") )

     

      THISFORM.StatusMessage("Updating MS Word Document",,,"Exporting")

      DOEVENTS

      WAIT WINDOW "" TIMEOUT .5

 

      oDoc = oWord.Documents.OPEN(FORCEEXT(lcFile,"doc"))

      oWord.SELECTION.Paste()

     

      DOEVENTS

     

      oDoc.SAVEAS(FORCEEXT( lcFile, "doc" ))

 

      *** Now replace the header text

      oWord.ActiveWindow.ActivePane.VIEW.SeekView= 9  && wdSeekCurrentPageHeader

      oWord.SELECTION.FIND.Execute("##Project Name##",.F.,.F.,.F.,.F.,.F.,.T.,1,.F.,lcProjName,2)

      oWord.ActiveWindow.ActivePane.VIEW.SeekView= 0  && wdSeekCurrentPageHeader

 

      *** Make visible in case macro restrictions are in place

      oWord.VISIBLE = .T.

 

      THISFORM.StatusMessage("Embedding images into MS Word Document",,,"Exporting")

      DOEVENTS

 

      *** Force Images to be embedded into the document

      TRY

        oWord.APPLICATION.RUN("EmbedImages")

        DOEVENTS

       

        *** Force Help Builder to the top

        THISFORM.AlwaysOnTop = .t.

        DOEVENTS

        THISFORM.AlwaysOnTop = .f.

 

        IF MESSAGEBOX("Export to Microsoft Word complete." + CHR(13) + ;

                   "Do you want to add a Table of Contents?",64+4,WWHELP_APPNAME)=6

           THISFORM.StatusMessage("Adding Table of Contents",,,"Exporting")

           DOEVENTS

           WAIT WINDOW "" TIMEOUT .1

           THISFORM.Refresh()

           oWord.Application.Run([AddTableOfContents])

        ENDIF

      CATCH

         *** No Op - just fail - Word displays an error message

      ENDTRY

     

      *** Go top in Word

      oWord.Selection.GoTo(1,1)

      oDoc.Save()

     

      DOEVENTS

        

      THISFORM.StatusMessage("MS Word Conversion completed...")

     

      *** Bring Word to the top

      oWord.Activate()

 

The code basically creates a new document that imports the HTML, then copies the HTML and pastes it into a pre-created Document template. The template contains several macros that then get run from via Automation calls to Word.Application.Run().

 

The result of all of this is a process that works really well even with very large documents. Word still hogs memory and CPU, but it no longer locks up. Images are embedded properly and life is good!


The Voices of Reason


 

Bela Bihari
January 26, 2005

# re: Importing an HTML document into Word via COM Automation and dealing with Image Embedding (revisited)

using your code I've implemented without Macros the same from vbs:

Dim oWord
Dim oDoc
Dim mFile
Dim x
Dim oField
Dim oImage

mFile = "c:\myFile.doc"
x = 5

Set oWord = CreateObject("Word.Application")
oWord.VISIBLE = 0
Set oDoc = oWord.Documents.OPEN(mFile)
oDoc.SaveAs mFile,0

For Each oField In oDoc.Fields
If oField.Type = 67 Then
x = x + 1
If x = 6 Then
oField.Select
x = 0
End If
oField.LinkFormat.SavePictureWithDocument = True
oField.LinkFormat.BreakLink
oDoc.UndoClear
End If
Next

x = 2
For Each oImage In oDoc.Shapes
If Not (oImage.LinkFormat Is Nothing) Then
x = x + 1
If x = 3 Then
oImage.Select
x = 0
End If
oImage.LinkFormat.SavePictureWithDocument = True
oImage.LinkFormat.BreakLink
oDoc.UndoClear
End If
Next

oDoc.Save()
oDoc.CLOSE()
oWord.Quit()
Set oWord = NOTHING

Ibraheem
April 06, 2005

# re: Importing an HTML document into Word via COM Automation and dealing with Image Embedding (revisited)

First of thanks for the great help, regarding word automation with embeded img's but still tht didn't solve mine prob, as i am still getting both collections of shapes and fields are empty in automating a doc having an .jpg image embeded....can you plz reply me how can i get the embeded images from the word document
i am a cs student doing final proj and struck here......soo i'm anxiously waiting for the reply...
regards
Ibraheem
Ibraheempindi@hotmail.com

j_lalith
September 05, 2005

# re: Importing an HTML document into Word via COM Automation and dealing with Image Embedding (revisited)

This is a great work but can somebody let me know what is meant by

If x = 6 Then
oField.Select

and
If x = 3 Then
oImage.Select

Daniele
September 10, 2005

# re: Importing an HTML document into Word via COM Automation and dealing with Image Embedding (revisited)

Rick,

your comment at the top ' This macro must be run while the images are in place' means that if Word hasn't yet loaded all linked images you'll miss some images within the saved doc, as it is happening to me.

Adding some wait after opening the html file and before unlinking images is not a solution I like very much.

Do you know of any method/property of Word objects that can be used to check if all images are fully loaded ?

Tks
Daniele

Toby Henderson
September 20, 2005

# re: Importing an HTML document into Word via COM Automation and dealing with Image Embedding (revisited)

Rick,

c# version for completeness

public void EmbedImages()
{
// Unlink Images
for (int j = _wordApplication.Application.ActiveDocument.Fields.Count; j > 0; j--)
{
if (_wordApplication.Application.ActiveDocument.Fields[j].Type == WdFieldType.wdFieldIncludePicture)
{
_wordApplication.Application.ActiveDocument.Fields[j].Update();
_wordApplication.Application.ActiveDocument.Fields[j].Unlink();
//stop undo buffer filling up, stops memory usage going mental.
_wordApplication.Application.ActiveDocument.UndoClear();
}
}

// Unlink Shapes
for (int i = _wordApplication.Application.ActiveDocument.Shapes.Count; i > 0; i--)
{
object shape = i;
if (_wordApplication.Application.ActiveDocument.Shapes.get_Item(ref shape).LinkFormat != null)
{
_wordApplication.Application.ActiveDocument.Shapes.get_Item(ref shape).LinkFormat.SavePictureWithDocument = true;
_wordApplication.Application.ActiveDocument.Shapes.get_Item(ref shape).LinkFormat.BreakLink();
//stop undo buffer filling up, stops memory usage going mental.
_wordApplication.Application.ActiveDocument.UndoClear();
}
}
}


Thanks for the original code, really helped me out of a tight spot.

l8r
Toby

marcos
October 28, 2005

# re: Importing an HTML document into Word via COM Automation and dealing with Image Embedding (revisited)

j_lalith,
The oField.Select and oImage.Select calls change the selected image in the UI. So if you watch the doc, then every so often the next image is selected. He does not want to waste time highlighting each one, just enough to let the user know that something is happening.

If you dont care about that, then you can leave it out.

abcabc
March 06, 2006

# re: Importing an HTML document into Word via COM Automation and dealing with Image Embedding (revisited)

Hi, I have a HTML page, properly formatted in IE, and I save that document to <filename>.doc It seems like Word is ignoring the position:absolute attribute. Can you give me a solution to this?

Heikniemi Hardcoded
October 01, 2006

# Heikniemi Hardcoded: April 2005 Archives


Rick Strahl
October 05, 2006

# Image Problems when Importing HTML into Microsoft Word via Automation - Rick Strahl

Some observations in creating Word Automation document from an HTML file and embedding the images into the Word document rather than leaving them as external links.

Heikniemi Hardcoded
October 06, 2006

# Heikniemi Hardcoded: Localizing linked images in Word (and horrible interfaces)


Heikniemi Hardcoded
October 20, 2006

# Heikniemi Hardcoded: Misc. programming Archives


Rick Strahl's Web Log
February 17, 2007

# Image Problems when Importing HTML into Microsoft Word via Automation - Rick Strahl's Web Log

Some observations in creating Word Automation document from an HTML file and embedding the images into the Word document rather than leaving them as external links.

# Calle Arnesten&#8217;s Web Log &raquo; Blog Archive &raquo; Using Word automation to embed linked images


John
August 05, 2009

# re: Importing an HTML document into Word via COM Automation and dealing with Image Embedding (revisited)

Darn, This stopped working in the last 30days for me on Office 2007. It had been working for the last year. now it no longer works.


Dim oWord
Dim oDoc
Dim mFile
Dim x
Dim oField
Dim oImage



Set objDialog = CreateObject("UserAccounts.CommonDialog")
objDialog.Filter = "VBScript Scripts|*.doc|All Files|*.*"
objDialog.Flags = &H0200
objDialog.FilterIndex = 1
objDialog.InitialDir = "C:\sar-july"
intResult = objDialog.ShowOpen

If intResult = 0 Then
Wscript.Quit
Else
Wscript.Echo objDialog.FileName
End If



mFile = "C:\sar-july\project.doc"
x = 5

Set oWord = CreateObject("Word.Application")
oWord.VISIBLE = 0
Set oDoc = oWord.Documents.OPEN(objDialog.FileName)
oDoc.SaveAs objDialog.FileName,0

For Each oField In oDoc.Fields
If oField.Type = 67 Then
x = x + 1
If x = 6 Then
oField.Select
x = 0
End If
oField.LinkFormat.SavePictureWithDocument = True
oField.LinkFormat.BreakLink
oDoc.UndoClear
End If
Next

x = 2
For Each oImage In oDoc.Shapes
If Not (oImage.LinkFormat Is Nothing) Then
x = x + 1
If x = 3 Then
oImage.Select
x = 0
End If
oImage.LinkFormat.SavePictureWithDocument = True
oImage.LinkFormat.BreakLink
oDoc.UndoClear
End If
Next

oDoc.Save()
oDoc.CLOSE()
oWord.Quit()
Set oWord = NOTHING

Alberto
July 22, 2010

# re: Importing an HTML document into Word via COM Automation and dealing with Image Embedding (revisited)


West Wind  © Rick Strahl, West Wind Technologies, 2005 - 2024