Extracting forms from a web page

Having just reached the first stage of developing a module for parsing HTML, I thought I'd share a wee snippet for gathering forms from a page to illustrate how the module can be used. The following function:

  • Loops through the page content copying everything stored between <form>...</form> tags
  • Loops through each form picking out the pertinent information from <input> and <textarea> tags

Will leave as an exercise as to improve on this:

import <markup>

gather-forms: func [
	doc [string!]
	/local forms mark extent attributes
][
	doc: load-markup doc
	forms: collect [
		parse doc [
			any [
				<form> mark: some [
					and <form> (print "Warning: cannot gather nested forms") break
					|
					extent: </form> (keep/only copy/part mark extent) break
					|
					end (keep/only copy/part mark tail mark)
					|
					skip
				]
				|
				skip
			]
		]
	]

	collect [
		foreach form forms [
			either map? pick new-line/all form true 1 [
				keep make object! [
					action: select form/1 "action"
					method: any [select form/1 "method" "GET"]
					fields: make map! collect [
						parse form [
							any [
								<input> set attributes map! (
									keep lock any [select attributes "name" "anon"]
									keep select attributes "value"
								)
								|
								<textarea> set attributes map! set text string! (
									keep lock any [select attributes "name" "anon"]
									keep text
								)
								|
								skip
							]
						]
					]
				]
			][
				print "Warning: form has no attributes"
			]
		]
	]
]

probe gather-forms to string! read https://forum.rebol.info

I had a r3/view script which would parse out all the forms on a page and allow you to enter values to see if you could post to it.

Any reason why you use func instead of the auto-collecting locals function ?

Well—Red has a GUI, wouldn't be a leap to make that work!

However, even just in a script the above is close to that:

the-form: first gather-forms some-web-page
... change some form fields ...
write to url! the-form/action [
    the-form/method
    to-webform to block! the-form/fields
]

Again, left as an exercise to resolve relative 'action urls.

Note to self: update <webform> to allow map! arguments.

how about including the gather-forms function into the module?

  1. Habit.
  2. Get dinged when I use words I hadn't accounted for.

'Cause it's a quickie and really could use some work—just thought I'd share it as it's common request and I figure it's a good start point if anyone wants to run with it now.

I've something better in mind anyhow.

BTW: should have a Cookbook/Reviews category in the forum.