145 lines
5.2 KiB
Plaintext
145 lines
5.2 KiB
Plaintext
From: mnot at pobox.com (Mark Nottingham)
|
|
Date: Wed, 28 Apr 1999 22:17:08 GMT
|
|
Subject: HTML "sanitizer" in Python
|
|
References: <s72703fc.021@holnam.com> <19990428152042.A708@better.net>
|
|
Message-ID: <00e501be91c4$db944f20$0301a8c0@cbd.net.au>
|
|
Content-Length: 5028
|
|
X-UID: 452
|
|
|
|
There's a better (albeit non-Python) way.
|
|
|
|
Check out http://www.w3.org/People/Raggett/tidy/
|
|
|
|
Tidy will do wonderful things in terms of making HTML compliant with the
|
|
spec (closing tags, cleaning up the crud that Word makes, etc.) As a big
|
|
bonus, it will remove all <FONT> tags, etc, and replace them with CSS1 style
|
|
sheets. Wow.
|
|
|
|
It's C, and is also available with a windows GUI (HTML-Kit) that makes a
|
|
pretty good HTML editor as well. On Unix, it's a command line utility, so
|
|
you can use it (clumsily) from a Python program.
|
|
|
|
I suppose an extension could also be written; will look into this (or if
|
|
anyone does it, please tell me!)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
----- Original Message -----
|
|
From: William Park <parkw at better.net>
|
|
Newsgroups: comp.lang.python
|
|
To: <python-list at cwi.nl>
|
|
Sent: Thursday, April 29, 1999 5:20
|
|
Subject: Re: HTML "sanitizer" in Python
|
|
|
|
|
|
> On Wed, Apr 28, 1999 at 12:49:55PM -0400, Scott Stirling wrote:
|
|
> > Hi,
|
|
> >
|
|
> > I am new to Python. I have an idea of a work-related project I want
|
|
> > to do, and I was hoping some folks on this list might be able to
|
|
> > help me realize it. I have Mark Lutz' _Programming Python_ book,
|
|
> > and that has been a helpful orientation. I like his basic packer
|
|
> > and unpacker scripts, but what I want to do is something in between
|
|
> > that basic program and its later, more complex manifestations.
|
|
> >
|
|
> > I am on a Y2K project with 14 manufacturing plants, each of which
|
|
> > has an inventory of plant process components that need to be tested
|
|
> > and/or replaced. I want to put each plant's current inventory on
|
|
> > the corporate intranet on a weekly or biweekly basis. All the plant
|
|
> > data is in an Access database. We are querying the data we need and
|
|
> > importing into 14 MS Excel 97 spreadsheets. Then we are saving the
|
|
> > Excel sheets as HTML. The HTML files bloat out with a near 100%
|
|
> > increase in file size over the original Excel files. This is
|
|
> > because the HTML converter in Excel adds all kinds of unnecessary
|
|
> > HTML code, such as <FONT FACE="Times New Roman"> for every single
|
|
> > cell in the table. Many of these tables have over 1000 cells, and
|
|
> > this code, along with its accompanying closing FONT tag, add up
|
|
> > quick. The other main, unnecessary code is the ALIGN="left"
|
|
> > attribute in <TD> tags (the default alignment _is_ left). The
|
|
> > unnecessary tags are consistent and easy to identify, and a routine
|
|
> > sh!
|
|
> > ould be writable that will automate the removal of them.
|
|
> >
|
|
> > I created a Macro in Visual SlickEdit that automatically opens all
|
|
> > these HTML files, finds and deletes all the tags that can be
|
|
> > deleted, saves the changes and closes them. I originally wanted to
|
|
> > do this in Python, and I would still like to know how, but time
|
|
> > constraints prevented it at the time. Now I want to work on how to
|
|
> > create a Python program that will do this. Can anyone help? Has
|
|
> > anyone written anything like this in Python already that they can
|
|
> > point me too? I would really appreciate it.
|
|
> >
|
|
> > Again, the main flow of the program is:
|
|
> >
|
|
> > >> Open 14 HTML files, all in the same folder and all with the .html
|
|
> > >> extension. Find certain character strings and delete them from
|
|
> > >> the files. In one case (the <TD> tags) it is easier to find the
|
|
> > >> whole tag with attributes and then _replace_ the original tag
|
|
> > >> with a plain <TD>. Save the files. Close the files. Exit the
|
|
> > >> program.
|
|
>
|
|
> Hi Scott,
|
|
>
|
|
> I shall assume that a <TD ...> tag occurs in one line. Try 'sed',
|
|
> for i in *.html
|
|
> do sed -e 's/<TD ALIGN="left">/<TD>/g" $i > /tmp/$i && mv /tmp/$i $i
|
|
> done
|
|
> or, in Python,
|
|
> for s in open('...', 'r').readlines():
|
|
> s = string.replace('<TD ALIGN="left">', '<TD>', s)
|
|
> print string.strip(s)
|
|
>
|
|
> If <TD ...> tag spans over more than one line, then read the file in
|
|
> whole, like
|
|
> for s in open('...', 'r').read():
|
|
>
|
|
> If the tag is not consistent, then you may have to use regular
|
|
> expression with 're' module.
|
|
>
|
|
> Hopes this helps.
|
|
> William
|
|
>
|
|
>
|
|
> >
|
|
> > More advanced options would be the ability for the user to set
|
|
> > parameters for the program upon running it, to keep from hard-coding
|
|
> > the find and replace parms.
|
|
>
|
|
> To use command line parameters, like
|
|
> $ cleantd 'ALIGN="left"'
|
|
> change to
|
|
> s = string.replace('<TD %s>' % sys.argv[1], '<TD>', s)
|
|
>
|
|
> >
|
|
> > OK, thanks to any help you can provide. I partly was turned on to
|
|
> > Python by Eric Raymond's article, "How to Become a Hacker" (featured
|
|
> > on /.). I use Linux at home, but this program would be for use on a
|
|
> > Windows 95 platform at work, if that makes any difference. I do
|
|
> > have the latest Python interpreter and editor for Windows here at
|
|
> > work.
|
|
> >
|
|
> > Yours truly,
|
|
> > Scott
|
|
> >
|
|
> > Scott M. Stirling
|
|
> > Visit the HOLNAM Year 2000 Web Site: http://web/y2k
|
|
> > Keane - Holnam Year 2000 Project
|
|
> > Office: 734/529-2411 ext. 2327 fax: 734/529-5066 email:
|
|
sstirlin at holnam.com
|
|
> >
|
|
> >
|
|
> > --
|
|
> > http://www.python.org/mailman/listinfo/python-list
|
|
>
|
|
>
|
|
>
|
|
>
|
|
|
|
|
|
|
|
|
|
|