69 lines
3.2 KiB
Plaintext
69 lines
3.2 KiB
Plaintext
From: tavares at connix.com (tavares at connix.com)
|
|
Date: Wed, 28 Apr 1999 20:54:13 GMT
|
|
Subject: HTML "sanitizer" in Python
|
|
References: <s72703fc.021@holnam.com>
|
|
Message-ID: <7g7shk$rgd$1@nnrp1.dejanews.com>
|
|
Content-Length: 2997
|
|
X-UID: 1342
|
|
|
|
In article <s72703fc.021 at holnam.com>,
|
|
"Scott Stirling" <SSTirlin at holnam.com> wrote:
|
|
> Hi,
|
|
>
|
|
> I am new to Python. I have an idea of a work-related project I want to do,
|
|
> and I was hoping some folks on this list might be able to help me realize it.
|
|
> I have Mark Lutz' _Programming Python_ book, and that has been a helpful
|
|
> orientation. I like his basic packer and unpacker scripts, but what I want
|
|
> to do is something in between that basic program and its later, more complex
|
|
> manifestations.
|
|
>
|
|
> I am on a Y2K project with 14 manufacturing plants, each of which has an
|
|
> inventory of plant process components that need to be tested and/or
|
|
> replaced. I want to put each plant's current inventory on the corporate
|
|
> intranet on a weekly or biweekly basis. All the plant data is in an Access
|
|
> database. We are querying the data we need and importing into 14 MS Excel 97
|
|
> spreadsheets. Then we are saving the Excel sheets as HTML. The HTML files
|
|
> bloat out with a near 100% increase in file size over the original Excel
|
|
> files. This is because the HTML converter in Excel adds all kinds of
|
|
> unnecessary HTML code, such as <FONT FACE="Times New Roman"> for every
|
|
> single cell in the table. Many of these tables have over 1000 cells, and
|
|
> this code, along with its accompanying closing FONT tag, add up quick.
|
|
> The other main, unnecessary code is the ALIGN="left" attribute in <TD>
|
|
> tags (the default alignment _is_ left). The unnecessary tags are
|
|
> consistent and easy to identify, and a routine should be writable that
|
|
> will automate the removal of them.
|
|
>
|
|
> I created a Macro in Visual SlickEdit that automatically opens all these
|
|
> HTML files, finds and deletes all the tags that can be deleted, saves the
|
|
> changes and closes them. I originally wanted to do this in Python, and I
|
|
> would still like to know how, but time constraints prevented it at the
|
|
> time. Now I want to work on how to create a Python program that will do
|
|
> this. Can anyone help? Has anyone written anything like this in Python
|
|
> already that they can point me too? I would really appreciate it.
|
|
>
|
|
|
|
Well, it wouldn't be that hard in Python to parse the HTML files and reformat
|
|
them in various ways. You can either go the route of straight text
|
|
substitution using regular expressions, or you could use htmllib to actually
|
|
parse the HTML files into a data structure, and the write them back out
|
|
again.
|
|
|
|
However, may I suggest a different method?
|
|
|
|
You've got your original data in Access. There are several different ways to
|
|
talk to Access from Python. You could pull your data directly from Access
|
|
using Python and skip Excel all together. And Python's got some great modules
|
|
for generating HTML. Heck, add CGI or Zope to the mix and you could generate
|
|
your inventory lists at the web server on the fly!
|
|
|
|
Ok, I'll calm down now.
|
|
|
|
-Chris
|
|
|
|
-----------== Posted via Deja News, The Discussion Network ==----------
|
|
http://www.dejanews.com/ Search, Read, Discuss, or Start Your Own
|
|
|
|
|
|
|
|
|