~
Home
Installation
Download
Project history
Project @ sf.net
Fora
Links
~
(Here be the sourceforge logo)
|
|
This is an abridged (spam and irrelevant messages removed) version of a thread
at the Searchlores' phplab forum. Here you can see the initial ideas and inspirations
behind this project, the early attempts and design decisions.
www.searchlores.org
phplab's forum
Original thread
|
phproxomitron (18/01/04 17:39:32)
| |
just to note it somewhere, and see if some may be interested.
I think it was already suggested somewhere on this board : now that proxomitron is no more supported (any changes on the ML ?), it could be a nice idea to recode it, i mean most of his functionalities, in php.
Packaged with a small webserver (nanoweb for ex) in an easy to handle installer, it will be a great tool.
The config could be done with a web interface, through an easy to understand interface (proxo was a bit confusing imho). And the php language will make it easy to hack it and add new modules, like plugging it with a proxy retriever/tester. Perhaps it will provide a first 'kernel' to add new features, and build a GPS - 'General Portal for Seekers'.
Add the scrolls and wands, add a mindweb-like utility, a bookmarks manager, and a bit of salt and pepper (ex: bookmarklets to easily inject informations in his modules, a SOAP interface to interconnect our personal portals togethers, etc..), and it will be the tool i/we dreamt of for years :)
These are the first thoughts. I'd like to have your opinion and suggestion on that project. I know some guys who will help to code it once the draft and specifications are done.
loki
|
Re: phproxomitron (18/01/04 19:33:34)
| |
funny you mention this, I started playing with the same idea yesterday: a php/web based proxomitron type application; I planned the start of this project after I updated the scrolls (Attn: Laurent my scroll visualise tool works great, even with all the color options :-)
I started from the idea like anonyweb, a webbased proxy filter you can configure (options stored locally in a cookie trough a separate cookie file)
and use on any computer without installation. Configure has the option to load a config file from somewhere on the net (like where you save it) and create a config cookie on the pc in question that you are using (eg public library, university...)
The config could be done with a web interface, through an easy to understand interface (proxo was a bit confusing imho).
I planned the interface to be something like the opera preferences...
more to come (sooner or later)
cinix
|
Re: Re: phproxomitron (19/01/04 01:46:22)
| |
Funny you mention this, I started playing with the same idea yesterday:
not my fault, they were talking about synchronicities :)
I started from the idea like anonyweb, a webbased proxy filter you can configure
What will be the scope of the configuration ? Do you plan to have all the features of proxomitron ? more ?
Do you need or want some help, of any sort ? I guess you have the anteriority :)
We can share any part of the project, be it modelisation/conception, implementation, documentation, gfx. As you wish.
What do you think about the GPS ? i'd like to collect all the features seekers would like to have in such a tool.
Loki
|
Re: Re: Re: phproxomitron (19/01/04 10:54:16)
| |
I must say, I havn't tought of it much yet. I planned to have most if not all features of proxomitron and then see what else could happen.
about anteriority: you don't have to wait for me. I'm currently involved in a "scrolls" project. It seems that you have it planned out a bit better. don't hesitate and start (maybe with a good concept outline). and tell me where I could be of help.
cinix
|
I'd like to have your opinion... sorry I can't resist :-). (19/01/04 23:55:45)
| |
Some ideas here ( brainstorming? ):
( Anyone want to make a block-diagram? )
1.) Let's start to write code, or recycle existing ones, work from existing modules?
( Different proxies/filters - ever seen junkbusters little child - privoxy? It have a web frontend. Or "two" at souceforge :-) ?. It is doin heavy filtering ).
2.) Header-filters: This is the warming-up part IMO. So when we know how to filter this after we checked the url against the lists, it is easier to go to web-filters.
3.) Two way to use the proxomitron web-filters: converting before use to something chewable ( like "s/match/replace/" ) which is a quick work ( I have something like this ), or write an inner parser. It would be interest measuring this parser in work - like a proxy, maybe serving more client.
4.) My Proxo died with more than 32768 filter ( :-)). Make it enough robust
to handle even natural-lang traslations? Or do it through some chaining/threading? It may work in two step with
a.) word-pairs and
b.) grammar-fixing
( Mine works with "a", and give good approximation, in languages I even can't read because of missing fonts. ) The lazier way is to reuse the SLS page and query through those servers. ( But if we think in independence, it is better to have our own code ).
5.) Normal places fish your automated-looking query ( 10000/s ), and blocking the "attacker". If it is going to be a project for a public, it needs some resource-saving part against abusing.
6.) Is there a chance to make it more powerful? I mean some smarter javascript-code parsing instead of just filtering it. ( There is a javascript-interpreter written in Java ). I don't know if it is worth it.
6.) IMO it needs a cache ( of the unfiltered copy). When you test a filter you can't refresh the page 20 time - it's resourcewasting. ( Better it needs a cached copy of the filtered page too ).
7.) More fun:
a.) Logging to file ( I love logs )! ( Then feed your web-path with 'walrus' :-) )
b.) Filter-lists
Semi-automatic updating of the block/bypass/and other lists. ( Here is an example: my proxy copy all the files to paths named after their origin. Now if I'm checking them for size the ones with one tiny gif are jumping out - they are candidates of the next blocklist. But they can be read from a logfile too. ) Also maybe we already know sites where they have blockists to download :-).
8.) More own url-commands ( beware: I put a button to every page with "bweb..'thispage'". If I click on this it's goin' out with the referer string. So the correct way to catch and filter them before they goes out ). Like one to check the url through google's "inurl", and an other to do the same in archiveweb.
9.) Page-rewriter to wap-tools? Also what to do with the things Proxo - in itself don't touch, but php able to do wonders with? Textfiles, pdfs, doc, xmls.
There is that image-tinker tool someone mentioned, which catalogize/search images by content? What about feeding all your incomig images with it? That would be a big fat proxy of yours/ours :-).
10.) Keyword-extractor of some in proxo-filter. Let's say you search a scientific territory from zero, and find th PAGE which looks perfect start. Now extracing the "important" words from this and use in the next turn?
11.) Webarchive-like url-rewriting for archiving purposes? Or any useful archiving technic. ( Again: check out "two". It fills a db with the distilled stuff. )
have
|
my opinion (20/01/04 13:32:18)
| |
i will now tell you that proxo can already do a lot of these things ;)
of course all of it in one program is nice. but remember cinix's latest essay,
we can already do a LOT of the things you ask for here..
| 3.) Two way to use the proxomitron web-filters: converting before use to something chewable ( like "s/match/replace/" ) which is a quick work ( I have something like this ), or write an inner parser. It would be interest measuring this parser in work - like a proxy, maybe serving more client. |
proxo has a special language (brrr parsers, studying that stuff right now,
exam tomorrow - yuck - most important thing i learned is that writing a
compiler/parser is HARD) anyway this language is a simplified variation
of regexps, better readable, just as powerful, and with special commands to
filter html specific things.
| 5.) Normal places fish your automated-looking query ( 10000/s ), and blocking the "attacker". If it is going to be a project for a public, it needs some resource-saving part against abusing. |
i thought it was meant to run on a local php server?
| 6.) Is there a chance to make it more powerful? I mean some smarter javascript-code parsing instead of just filtering it. ( There is a javascript-interpreter written in Java ). I don't know if it is worth it. |
proxo is already quite powerful like it is, but adding some real
imperative programming power to transform webpages will give it some
real interesting applications..
| 7.) More fun:
a.) Logging to file ( I love logs )! ( Then feed your web-path with 'walrus' :-) ) |
proxo can already do this. for my referer-checking experiment a few months
ago (which totally failed, as some of you may remember :) ) i made a proxo
header filter that logged every page i visited, plus a special code that also
appeared in my referer.
if you wish i can post the source of the filter how i did it.
| b.) Filter-lists
Semi-automatic updating of the block/bypass/and other lists. ( Here is an example: my proxy copy all the files to paths named after their origin. Now if I'm checking them for size the ones with one tiny gif are jumping out - they are candidates of the next blocklist. But they can be read from a logfile too. ) Also maybe we already know sites where they have blockists to download :-). |
not sure what you exactly want to do here, but proxo also works with block-
lists.. check the bannerblaster filters etc.
| 8.) More own url-commands ( beware: I put a button to every page with "bweb..'thispage'". If I click on this it's goin' out with the referer string. So the correct way to catch and filter them before they goes out ). Like one to check the url through google's "inurl", and an other to do the same in archiveweb. |
other ways are: javascript-bookmarklets to put as buttons on your Personal Bar,
or the custom protocol handler in cinix's essay sounds very promising as well :)
your other suggestions are also nice..
- ritz
ritz
|
Re: my opinion (20/01/04 23:36:19)
| |
Q:i thought it was meant to run on a local php server?
A:If the architecture ( like a php-filter/proxyserver ), already able to work remote/online, it can't be wrong to plan it against abusing.
Q:if you wish i can post the source of the filter how I did it.
A:Yes, thank you.
Q:Semi-automatic updating of the block/bypass/and other lists.
A:I mean you know how a HTTP round looks:
Client -- please gimme that page! - http://here/itis.htm ( host:here )
Server -- ok here it is.
Client( parsing ) -- and please - http://here/nice.gif - too ( host:here )
Server -- ok here it is.
And now from the same page:
Client -- Hey gimme http://bloodyads/ugly.gif ( host:bloodyads ) <--- Now while we can catch them because they are "outer" links from "here" ( maybe Opera do this ), what I try to say we can catch them by "pattern" too ( aaaaBaaaa :-) ), so maybe with a button/link click we can ask this proxo-thingy to:
- parse its logfile
- fish the badguys
- write/append the founded strings to its blockfile ( even resort it ).
( -maybe reread the blockfile ). So the deal is the living update from its log.
But like you pointed out, maybe we already able to do this.
Q:Proxo already knows it.
A:Yes but we move to "platform-independency".
Q:other ways are: javascript-bookmarklets to put as buttons on your Personal Bar
A:They are not so independent from the platform, ( and myself has less luck with using them :-) ). What I dreamin' about is something which is understand the screwed up webpages like a smart human with a smart browser, but run in on its own from the command-line - let's say our "phproxy" clean up javascript links like a browser do ( from a browser you 100% can follow an obfuscated url ?! ), then you can chain it with any bot, and it is really 'act' like a human.
have
|
Re: Re: my opinion (21/01/04 12:57:25)
| |
| Q:if you wish i can post the source of the filter how I did it.
A:Yes, thank you. |
ok maybe this evening.. i don't have much time :) i only have the
filter at home.
| Q:Semi-automatic updating of the block/bypass/and other lists.
A:I mean you know how a HTTP round looks:
Client -- please gimme that page! - http://here/itis.htm ( host:here )
Server -- ok here it is.
Client( parsing ) -- and please - http://here/nice.gif - too ( host:here )
Server -- ok here it is.
And now from the same page:
Client -- Hey gimme http://bloodyads/ugly.gif ( host:bloodyads ) <--- Now while we can catch them because they are "outer" links from "here" ( maybe Opera do this ), what I try to say we can catch them by "pattern" too ( aaaaBaaaa :-) ), so maybe with a button/link click we can ask this proxo-thingy to:
- parse its logfile
- fish the badguys
- write/append the founded strings to its blockfile ( even resort it ).
( -maybe reread the blockfile ). So the deal is the living update from its log.
But like you pointed out, maybe we already able to do this. |
so you mean logging the not-blocked
maybe-ads to a file.. then filtering out (by hand/script i presume)
the maybe-ads to produce new additions to the original blockfile?
yea if you can combine a maybe-ad filter (that doesn't filter anything)
together with the logging script (that i will post soon, but it's also
in the proxo-manual) you can already do this with proxo.
| Q:Proxo already knows it.
A:Yes but we move to "platform-independency". |
w00t! ;-)
- ritz
ritz
|
proxo logging (21/01/04 16:25:31)
| |
to make a list, you need to create one in the default.cfg file,
in the [Blocklists] section, add a line that looks like this:
List.SiteLog = "d:\projects\webmastertrapper\sitelog.txt"
in this case, the list is called SiteLog and it's
located at d:\projects\...
then to add something to the list, you make a filter that looks like this:
In = FALSE
Out = TRUE
Key = "User-Agent: mastertrapper (out) "
Match = "*"
Replace = "$ADDLST(SiteLog,$DTM(mHc) : \u)Qspeedbot/0.$DTM(mHc)
(compatible; http://qcrawl.4x2.net/botinfo.php?page=v$DTM(mHc) )"
what's it do? everytime you load a page (or an image, anything),
it adds a line in the Sitelog list that looks like this:
xxxxxxxx http://whereever.you.was.surfing
and changes your useragent to look like:
Qspeedbot/0.xxxxxxxx (compatible; http://qcrawl.4x2.net/botinfo.php?page=xxxxxxxx )
this was a little project of me a while back, to see if i could trap
referals from webmasters that would click through their (hopefully) online
logs. the xxxxxxxx is a unique code that changes with the time and the
connection number, and is also saved in the botinfo.php script (which to
the user just displays a very general explanation of what spiders do,
i copied from somewhere).
so the idea was that i could see from what unique code someone clicked
through on my 'explanation' page, and crosscheck that with my SiteLog.txt
to see what site it came from.
unfortunately it didn't work. the only hit i got was when i was surfing
a friend's private apache server with the filter on. he was of course
surprised how a 'bot' could have found his server ;)
this was after surfing for about two months with this filter ;)
it did prove to be useful to have a long list of all the urls i ever visited
ready for grepping though :) [as i delete my Opera history from time to time]
hey if anybody has any ideas where i might have gone wrong, or suggestions
for a better project, i'd love to hear about it :)
bye,
- ritz
ritz
|
proxo logging - Thanks, its my fault (21/01/04 22:47:15)
| |
While your "project" interest in itself, the point I missed seems to be the line with the "$ADDLST" command which I knew about, just brainless to use it
for logging. I wanna be a pensioner :-( . So thank you.
If you have a little time maybe check out "txt2regex.sourceforge.net".
"...With a simple interface, you just answer to questions and build your
own RegEx for a large variety of programs, like awk, ed, emacs, grep, perl, php, procmail, python, sed and vim. There are more than 20 supported programs. It's bash so download and run, no compilation needed."
It work on cygwin with bash >= 2.04.
It may help in a proxo-php regex converter.
have
|
Re: proxo logging (23/01/04 13:11:03)
| |
Phila
|
eh, no.. (23/01/04 14:14:31)
| |
proxo is short for the Proxomitron, a program that runs as a local proxy and
filters your http-traffic according to regexp-like user-defined filters.
it already implements a lot of the functionality discussed in this thread.
but proxo is not open-source, and its development is discontinued.. :(
- ritz
ritz
|
Re: phproxomitron (20/01/04 09:30:00)
| |
A nice name, especially to pronounce with a mouth full of dry biscuit :)
(and no cheating with that lame "pee-eitch-pee" pronounciation, no vocals means no vocals :))
Sorry, sorry, couldn't resist it :)
On the topic, thoughts:
1. We shouldn't be *writing* a proxy, just add filtering capabilities! Research some good ready open-source php proxies. What are the diffrences in them? Which should we choose? If none is supperiour, and several are widely spread, consider writing a wrapper layer, so the code will work with any of the wrapped proxies.
2. Consider combining multiproxy and proxomitron capabilities. Should this be done on the server side? Or should the client manually chain the phproxomitron with a standalone multiproxy (or another proxy of his choice).
3. That reminds me - chaining capabilities are a must!
4. Umm, is php the right tool/language? Does anyone have experience with how constantly running php scripts behave? Are they memory or processor heavy?
5. We should start work by *designing* the software, then coding it, esp. if there's going to be more than one developer! Else, the work will get chaotic and slow, and if no consitent coding standarts are followed, the maintenance will be hard.
6. sourceforge vs other shared development environments? What are the alternatives?
Mordred
|
Re: Re: phproxomitron: some resources (20/01/04 11:07:27)
| |
http://php.justinvincent.com/home/articles.php?articleID=15
http://sourceforge.net/projects/php-proxy/
I had these already local, not tested yet.
cinix
|
Re: Re: Re: phproxomitron: some resources (20/01/04 15:04:54)
| |
which article are you referring to for the first url, cinix?
it seems to be invalid
edd
|
http://php.justinvincent.com/home/articles.php?articleId=15 (n/t) (20/01/04 16:18:27)
| |
Mordred
|
Reason for php pop3 proxy (20/01/04 20:14:35)
| |
Someone asked about continious php running applications: I tought it could be of interest.
The other: It is not because it is somewhat older that it is of no use here.
cinix
cinix
|
php proxies (20/01/04 15:17:55)
| |
in addition to cinix:
there were some threads in this forum about proxies and php.
* php and proxy chaining project
- http://www.2113.ch/phplab/mbs.php3?num=-1&thread=1040030839#1040030839
Maxpayne propose a project and even the guy that created nanoweb (the php web server) posted message.
This project deserves a look and it's abandoned (so sounds good for this proyect)
* ANONWEB
- http://dividuum.de/p/anonweb/
> in action: http://www.autistici.org/
xcx
|
Re: PHProxies (20/01/04 17:11:59)
| |
(I'm afraid the links up to now are pretty useless - one is a pop3 proxy, another is last released in 2002 :( )
A quick search brought those, which *seem* better, but someone has to d/l and try them (RL job prevents me to fully cooperate on this right now :( )
http://sbp.sufferingfools.net/
It is a 'browser proxy', not a proxy server, but I guess it won't be hard to convert, what d'you think?
Mordred
|
Re: Re: PHProxies (27/01/04 15:23:20)
| |
http://sbp.sufferingfools.net/
browser proxy : web layer (if there is such a layer)
if we want to have the http proxy functionalites, we need to start one layer lower, with an http proxy. I had a look at the config file, i think it will be dirtier to work from there. we would need to add a lot of functionnalities to update the config files. I think it is not a good base to(but i'm not a good designer/coder) work on the http headers.
loki
|
Re: Re: Re: PHProxies (27/01/04 15:38:58)
| |
http://lwest.free.fr/doc/php/lib/index.php3?page=net_http_client&lang=en
"Net_HTTP_Client is an almost complete HTTP Client"
humm. Maybe it will be possibile to build the phproxo on what i called 'web layer' (adding servives to a http client - how to run it locally and set the 'true' clients connect through it ?)
loki
|
Re: Re: phproxomitron (27/01/04 15:13:10)
| |
| 1. We shouldn't be *writing* a proxy, just add filtering capabilities! Research some good ready open-source php proxies. What are the diffrences in them? Which should we choose? If none is supperiour, and several are widely spread, consider writing a wrapper layer, so the code will work with any of the wrapped proxies. |
Yeah, i agree on that. Now, time to test and review those scripts/libs
questions :
- usage : local / remote / both ?
- layer : http / web / both
- using proxomitron architecture and 'languages' ?
| 2. Consider combining multiproxy and proxomitron capabilities. Should this be done on the server side? Or should the client manually chain the phproxomitron with a standalone multiproxy (or another proxy of his choice). |
i think multiproxy (management of proxy lists, tests, rotations etc..) could be a separated module, but using the same kernel, to easily plug them together.
| 3. That reminds me - chaining capabilities are a must! |
create a protocol for 'inter-phproxo' communications ?
managing list of phpproxos willing to chain. But therefore
we have to handle trees, and loops :)
I think this idea can be an objective of the 'multiproxy project'.
| 4. Umm, is php the right tool/language? Does anyone have experience with how constantly running php scripts behave? Are they memory or processor heavy? |
Yup, i have a lot of friends running nanoweb as web server, and all seems to work good :)
You can check the performance tests here : http://nanoweb.si.kz/?p=perf
loki
|
phproxomitron answers - and more... (28/01/04 03:32:11)
| |
ANSWERS:
- usage : local / remote / both ? - BOTH
- layer : http / web / both - MUST TO BE BOTH
- using proxomitron architecture and 'languages' ? -
Architecture:
IF NOT 'USING', WORTH TO CHECK THE AMOUNT/KIND OF parameters it is using. I repeat: if it is doing the same thing, one can easily write a converter script between a proxo-filter, and the new one. This is only one more step to integrate this converter to the new project. Proxo-scripts are both human and machine readables and this is an advantage too.
Language - regexes:
Let's check what are the differences between proxo- and php regexes. Here we may need some work. ( I know there is more, just proxo is about perfect - if we want to use something like that, it is a MUST to implement the originals abilities. It doesn't matter on what name we call the functions, but we need about all of them. Maybe if it is too difficult, we can go back to Proxo Naoko3.* - that was a little simplier ).
Consider combining multiproxy... I think it needs only an HTTP header, and parsing of a proxy-list. ( Using Ritz's idea we can update this kind of things about realtime. So find a new proxy - feed to your list. Also if someone care find some random-idea to rotating them. )( Sorry I'm also don't feel so much profit in paranoid-proxying generally. )
create a protocol for 'inter-phproxo' communications ? ( I think this can go through the SAME new url-commands I dreamed of, or better header-commands. My new idea is stego: some kind of pseudo-random session-like strings, or better just hide them in some of the headers like an extra "Content-Length: :0121"<--- we catch the malformatted header with the command, but noone else care about it ( how I remember even an extra whitespace enough ). And headers maybe don't logged so hard. ( Another: play with the headers ORDER. I see 8 string so we have some chance to mix them - feel it? :-) )
managing list of phpproxos willing to chain. But therefore
we have to handle trees, and loop... ( My opinion - this one later )
Umm, is php the right tool/language? - If you have some local cache, you can chain nanoweb with, then query it with wget. I think for personal use it is good enough. For remote I think it is not too much user/filter ( like you set up one for yourself, then tell it to some friends - by a big chance 2-3 person use it in different times. And not with the most difficult/amount of filters.
Maybe it needs a spare function when it works with its default settings, and don't want to WRITE/UPDATE files? )
The web-admin thing - may switchable through the url/header commands. Even have the some/same commands through command-line ( and/or headers/urls ), and "GUI".
I think I start to work on this for fun. My choice is nanoweb, which from the developers page is ALREADY proxy-capable ( I think it is only relaying ). At the time of my post their site is down, but I found a mirror at www.arte.unipi.it/Public/misc/ filled with other interest stuff ( a Goldmine actually with more script to work/learn ).
Also it is worth to decide what( name/system ) to call this. phproxo good for us, but if this project goes outside, maybe we wake some enemies on us/our program ( and they start to learn/block it). Filtering in personal is ok., but I saw places where they talked about copyright breaking ( you "modify" the content :-) ). So we play this with shield&sword, or stealth? If stealth, lets call it a "modul to nanoweb" and hide behind it. For me it is good either way :-)!
And one more, In my Flash MX 2004 bloat-bag I found this gem:
JavaScript_1_5.zip 884,358 JavaScript Interpreter ( c sourcecode )
..."JSRef builds a library or DLL containing the JavaScript runtime (compiler, interpreter, decompiler, garbage collector, atom manager, standard classes). It then compiles a small "shell" program and links that with the library to make an interpreter that can be used interactively and with test .js files to run scripts. The code has no dependencies on the Navigator code."
So maybe this is the possible extension to any of the filtering projects?
have
|
Re: phproxomitron answers - and more... (28/01/04 09:14:09)
| |
hi
i'm the author of the mod_proxy for nanoweb
i'm also interested by this project :)
franck
|
maybe (28/01/04 22:32:52)
| |
is it maybe possible to hack proxo into doing what we want?
[let's first not think about the ethics of this, i can make up an excuse
anytime you want]
on the other hand, maybe i still don't quite understand what this
tool is exactly going to do..
anybody care to draw up a design or something?
so it's php, proxy, proxomitron, with a webserver..
so euhm, the webserver is running - serving what kind of websites?
or is the webserver implementing a proxy? do you need a webserver to run
a proxy?
then where does the php come in?
i can spew all kinds of interesting ideas, but if i don't know what we're
talking about they may not be very constructive :)
please somebody be concrete and exact to me :)
- ritz
ritz
|
Summary. ( about 9 bigger different task ) (29/01/04 02:26:22)
| |
So if you READ the thread you can summarize this way:
We can force Proxomitron to do a lot of things,
but we must to RE it, and program/hack on its native language ( C? ). I
don't want to. I don't feel any moral problem on this one - Scott left
his grown-up kid. What we need is:
1.) A real proxy server which is:
2.) Platform independent.
3.) Open-source in the original and the phplab meaning.
Its code is both reachable for all, and made for reading/learning.
Also:
4.) Easily extendable to be:
a.) useful for remote proxy too
b.) functioning not only like a filter, but have caching functions, then later
c.) have web-archive like functions ( url-rewriting to YYYYMMDD style directories maybe )
d.) possibly have an own communication form of some ( like proxo's
url commands, but maybe a bit more "hidden". )
e.) chainable!
At present I can say, that we may don't need a 'web server', and a
'proxy script', because we have a good candidate to tinker with. It is
called nanoweb, and it is a php-script ( with some php-dependencies ),
which IS a web server, and already a basic ( relaying ) proxy. You only
need a php interpreter1 to
use it ( of course on the top of the hardware requirements - net
connection ) like BOTH a web server and a proxy. ( See APPENDIX also )
What we need to do the job:
1. IT WOULD BE THE BEST IF SOMEONE NICELY COMMENT THE CODES,
FOR BETTER LEARNING PURPOSES.
2. Some coders for the http layer ( or maybe it is ready ).
( code to work from - nanoweb - http://nanoweb.si.kz/ )
3. This level we have the HEADER filters. Good start to implement
a.) the filter file
b.) the filtering fuction
4. Implementing the html parsing, filtering.
Example code to work from:
www.arte.unipi.it/Public/misc/phphtmlparser_1_0.tar.gz
( And also this phplab )
5. Implement the filter-config file.
Example code to work from:
proxo's own primarily if possible
( nanoweb already open/read its config-files so I think it would be not
too difficult 'til open one )
6. Check the difference between proxo and php regexes ( and rules ).
7. If it is needed, write converter script ( possibly bidirectional ).
8. Go after general problems and write filters to them -( Proxo-filter writers )
Like MIME fixing:
Example code to work from:
www.arte.unipi.it/Public/misc/mime_lookup.php.gz
Or write/dig up and implement - html2wml ( your own wap-gateway ), xml2html, and
of course natural language filters.
9. Now comes another big part - caching, rewriting, organizing semi-dynamic content like mbs:
http://www.2113.ch/phplab/mbs.php3/mb001?thread=1075106564&num=1075306410
|
...I was hoping that I could find the post with the forum's search engine, but since it failed, I had to devise another method of reaching it...
...The only solution I found was brute-force: download the entire forum (the OpenSwf section of FlashKit took 100M :) and grep from the local machine.
The whole matter is nothing much really, just a demonstration of the idea that it's useful to download entire MBS-s. I had even started to design a script system for downloading known forums (i.e. any yahoo group, any phpbb, any vbulletin etc.), but later decided against it - it's not needed that often, and you can do it 'by hand' quickly enough (iirc, I used GetRight's browser, because of its great feature to sort links in the loaded html - by type, size, address, etc.).
I've posted a script once at the php board, that could download Laurent's mbs (i.e. this one, php, ebmb), but it can be replaced by a single wget command. Well, the script behaved better - it decided whether to download a thread by the date of the last post, while wget has to make a request to the server, but that has meaning only for incremental updates. If someone's interested - it's at the php, no updates since.
|
So there is bigger tasks than the everyday browsing. What about
rewriting this very messageboard's 'dynamic' urls, and remapping them to
our local cache. So we both have the info-path to the original, and the
old content in static form. Ritz? Proxo-filter for rewriting the urls -
translating between the client and the boards :-)? And also it is maybe
possible time/space saving to put the local copies to a database - MySQL?
Now go to two.sourceforge.net and download/read the documentation.
Note 1
Actually you need the "php_sockets.dll" extension also. And you must to
play with the setup :-) if you are not default type.
APPENDIX A. PROXIES
WE CAN SAY THAT A PROXY IS A WEB SERVER WHICH IS SERVES CONTENTS OF OTHER MACHINES.
FIGURE 1. RELAYING
_________ _________ __________
! ! 1 QUERY ! PROXY ! 2 QUERY ! REMOTE !
! CLIENT !---------!WEBSERVER!-----------! SERVER !
!_________!4 ANSWER !_________! 3 ANSWER !__________!
( Original purpose - time saving. Good for us if remote - already implemented ! )
A CACHING PROXY SAVE THE CONTENT OF A REMOTE ANSWER, AND IF RE-QUERIED,
SERVE IT FROM ITS CACHE - SAVING BOTH TIME AND MONEY.
FIGURE 2. CACHING PROXY HAVE THE QUERIED DOCUMENT:
_________ _________ __________
! ! 1 QUERY ! CACHING ! NOONE ! !
! LOCAL !---------! PROXY !-----------! REMOTE !
!_________!2 ANSWER !_________! CARES NOW !__________!
( Great time and bandwith saving. We need flushing/refreshing for this )
3. FILTERING PROXY
_________ _________ 2.FILTERED __________
! ! 1 QUERY !FILTERING! QUERY ! !
! LOCAL !-----------! PROXY !-----------! REMOTE !
!_________!4. FILTERED!_________!3. ANSWER !__________!
ANSWER
Let me show you an example: You have the local copy of the rfc-s. So you
don't want to even link to remote ones. Now this is how I did it with
Proxo. ( Badly written filter )
Name = "RFC to local link"
Active = FALSE
URL = "*"
Bounds = "rfc*"
Limit = 256
Match = "( rfc[#0:5000])\1(.txt|.html|.pdf)\2"
Replace = "<img src="http://local.ptron/local.gif" width="45" height="16"
alt="we have a local copy of \1\2 (html)">"
have
|
Re: phproxomitron (28/01/04 20:17:34)
| |
In order to support this so interesting project, i would be pleased to open a dedicated board (or pouche, if you think it's better) on phplab where people interested in this project -i see there are many- could freely talk about it.
Let me know if you think this is a good idea.
Laurent
|
a good idea... at least as memento mori (28/01/04 23:34:01)
| |
indeed, it would be a good idea to open an ad hoc pouche
Let's hope this developes into solutions. Lately all our nice projects seem to go astray instead of snowballing :-(
As we have already seen more than once, alas, when the time cometh to transform projects into deeds, people & friends & seekers & whatsnot suddendly (tend to) disappear "dans la nature".
But a pouche would be a memento mori of sort, at least :-)
Go ahead, if you find the time.
fravia+
|
Thank you - an idea/sketch (29/01/04 03:23:06)
| |
What about forcing people to do a bit more pre-classifying for the good for all?
What I thought is maybe a preface-like header ( on the TOP OF THE PAGE ) with yours:
1. avoid to go off-topic, use the messageboards for that
2. try to organize your comments by using ul / li tags. Highlight terms with u, b, i tags.
and:
3. Please try to classify, and sign your post by its nature too:
1.a developing the client-server-server communication part
1.b knows implementable code examples
2.a developing HEADER filters
2.b or know some about it
3.a developing the content filter-parser part
3.b knows code examples
4.a developing filters what kind
4.b or just find some
5.a developing cacheing/mirroring functions
Possible name-caching together with the blocklist(s)
5.b knows code examples( alternative solutions )
6.a big stuff ( MySQL - mbs, semi-local SE-s )
6.b and its already exist code examples
( If we have a cleanly identifyable class for a thread, we can make nice different color rows which shows what part moving what standing, re-parse the pouche to links/tools/deadends whatever, and more... )
And please keep hanging the name for a little more. Call this "php-filtering board ( or pouche )". Read my boggling about it if you want. We/I can migrate the essence of this thread to there too for start.
have
|
pouche created (29/01/04 21:16:57)
| |
Laurent
|
Design proposal 0.1 (29/01/04 16:17:03)
| |
========================[ PHProxo Design document]============================
Version 0.1
-------[ Overview ]-
PHProxo is an http proxy server implemented in php, whose basic purpose is rewriting of input and output data in an http session, similar to the (now discontinued) Proxomitron local proxy. Its modular design allows other functionality to be plugged in the process of passing and receiving information.
-------[ Structure ]-
1. [SM] Server module -- Accepts HTTP requests from the clients, returns (possibly modified) responses from the target server
2. [CM] Client module -- Forwards (possibly modified) HTTP requests from the client to the server and receives its response
[Note: these are 'input' and 'output' from the client's viewpoint]
3. [QM] Re_q_uest plugins - Operate on client's requests
4. [PM] Res_p_onse plugins - Operate on server's responses
Data flow of a normal transaction:
----[ User ]-------------------[ Proxy ]----------------------[ Remote server ]----
1. Browser: HTTP request
2. [SM] accept request
3. [QM] Process request with output plugins
4. [CM] issue modified request to remote server
5. Modified HTTP request
6. Answer
7. [CM] accept answer from remote server
8. [PM] Process answer with input plugins
9. [SM] forward modified answer to client
10. Client's browser receives html/error code
------[ Server module ]-
Accepts HTTP requests from the clients, returns (possibly modified) responses from the target server.
Features:
1. Process multiple requests simultaneously
2. Logging
3. POST-ed data
4. Password authentication
5. IP Accept/Deny configuration system (only local or accept limited outside connections, or accept all outside conections)
6. Cache system
Possible codebase: nanoweb
http://nanoweb.si.kz/
------[ Client module ]-
Forwards (possibly modified) HTTP requests from the client to the server and receives its response.
Features:
1. Issue multiple requests simultaneously
2. Persistent connections
3. POST-ing data
4. Connect through proxy (thus we can chain our to other proxies :)
Possible codebase:
http://nanoweb.si.kz/
http://sbp.sufferingfools.net/
http://sourceforge.net/projects/php-proxy/
http://snoopy.sourceforge.net/
http://anton.concord.ru/ ('street' html parser)
------[ Request/response plugins ]-
The system provides a plugin interface to allow modifying of the request headers and response headers/body. On each incoming request/response it enumerates the available plugins (possibly checking with some configuration file which of these are active or not)
//example:
class CRequestPlugin {
//overload this to modify request headers
function Process(&$sRequest) {
return $sRequest;
}
};
class CResponsePlugin {
//overload this to modify response headers
function ProcessHeaders(&$sResponseHeaders) {
return $sResponseHeaders;
}
//overload this to modify response body
function ProcessBody(&$sResponseBody) {
return $sResponseBody;
}
};
This flexible plugin system allows one to write for example:
- i/o Plugins for Proxomitron-compatible RE filters
- preg_replace)()-based filters
- ereg_replace()-based filters
- custom code filters (i.e. remove all images with proportions 7:1)
- parse HTML and work on the DOM tree
It may be resonable to allow the filtering process to issue more requests to the server (a stupid example: check if a file link in the response page is available for download and add it's size to the response page)
Mordred
|
Server module: Proof of concept (30/01/04 12:54:57)
| |
warning: This is an extremely ugly code, I copied from
http://www.zend.com/zend/tut/tutorial-staub3.php
and then kicked in the teeth until it started behaving like a proxy
The purpose of it is to test if it works on different platforms, or what problems arise if not. Note that it outputs various notice and warning messages, ignore them for now. Under win32 you need php_sockets.dll module enabled.
Usage:
Tell your browser to use proxy on localhost:9000, then run this code. I found it easier if I run this from opera, while setting IE to use the proxy. the script should terminate after 30 seconds, if not - kill manually.
My platform - Xp, php 4.3.4
(lotta source here, removed)
Mordred
|
Re: Server module: real code candidate (30/01/04 16:44:25)
| |
Rewrote this from scratch in a more as-god-intended-it-to-be manner.
Again, test in your place, and this time check the code also
are the class interfaces okay, is the name convention okay, are the right socket functions called with the right params?
(lotta source here, removed)
Mordred
|
real code candidate - dumb notes ( nanoweb proxy too ) (31/01/04 02:29:15)
| |
Just because everybody doin' somethin, I can tell you my ( thin/dumb ) findings.
Tried your first code, it was able to write 1,6M error-log under 30 sec! Tried your second code - it did something, but please don't wait 'till I clearly understand the whole thing :-) - just keep working. How I see it is a standalone code without any plus server-layer? Good for us.
I'm succesfully kicked to work nanoweb's mod_proxy my url is http://localhost/http://www.2113.ch/phplab/mbs.php3?num=1075477465&thread=1074443972!
So it's caching is plain-nice, but it making only root dirs, so need some recursing-code ( this file is "%2Fphplab%2Fmbs.php3" instead of mbs.php3 in /phplab/ subdir ) - while I'm archiving-type I may even eliminate the cache-flushing code ( if I want to delete I do it myself ).
Dumb advice for dumb people:you need different IP address (if not specified port?) to your mod_proxy and client. Now my proxy allowing connect from 127.0.0.1 and IT IS serving from 127.0.0.0 ).
have
|
etc. & nanoweb ;) (31/01/04 11:25:27)
| |
etc.:
the first code is NOT *my* code (shiver), even code I write strictly for my own use is incomparably more beautiful :) I wanted only to see how (and if) it would work on other machines, I just make it compile and run (yes, it didn't even compile) and then behave like a proxy that can really be hooked to a browser (i.e. wait for \r\n\r\n and then answer).
Then it happened that I had some free time, and I rewrote this in a more proper way, not to mention actually *working* - the original code, apart from generating errors like mad, barely checked return conditions, and where it DID check, it did so wrongly (if ($input == null) ,my ass!).
nanoweb:
I don't like the code (erm, why do I put this in first place ... okay forget it) , and from what I've read and understood it is not well suited for purposes. It relies on forking, and boldly makes blocking calls, which means that on Windows every url fetch will be a blocking one.
*sigh* as it seems we WILL have to write our own proxy server :(
Mordred
|
Mordred, this stuff simply screams... (31/01/04 14:16:36)
| |
...please transform me into an essay!
Ya know, friends, history is unforgiving for little boys that do not leave crumbles when they venture into their dark forests (although, come to think of it, little birds will steal the crumbles anyway :-(
But you (should) get a fuzzy warm feeling just coz you'r leaving your crumbles behind.
So Mordred, Sir :-)
Please be so kind and put everything down so that even Joe the young surfer, your future friend and helper (and historian?), will be able to follow your interesting, clever paths.
fravia+
|
ETC. & (nanoweb) ;-), libcurl? Cygwin? (31/01/04 14:25:54)
| |
*sigh* as it seems we WILL have to write our own proxy server :(
Never have a bigger problem :-)!
Please ignore dumb questions, but answer interest ones!
How I see we can still use the given codes ( Nanoweb, sbp ) for cheat-sheet, for what to do, and what not. Also I wan't to extract the info from Proxomitron's what's new file, about its solved errors to see if there is something to learn from.
NEED A LIST OF...
We must to know if our proxy need any (special) config of php ( check out my table on the other side - we must to declare case-insensitive regex support to be compatible with proxo ). Also if there is version dependencies.
Also worth to mention, that according to their board/manual Nanoweb ( PHP ) able to fork on MSWin with Cygwin - we need Cygwin-PHP to do he same? Then declare it for the users/manual for being platform-idependently correct.
"because current PHP.exe versions do not provide a wrapper to the POSIX fork() system call (which exists as a variant at least for NT). If you know how to do, you could compile PHP yourself using the Cygwin GCC to get a fully working version."( there are precompiled binaries out there of course )
I saw in my PHP-extensions dir a php_curl.dll ( libcurl 7.10 ). Now curl is about the same smart like wget or better, and good documented ( except they didn't built it so bot/massdownload-like, but this is not SO big problem ). Build the code around it, by using it for outgoing request, save a lot of coding?
"2. [CM] Client module -- Forwards (possibly modified) HTTP requests from the client to the server and receives its response "
"libcurl is a free and easy-to-use client-side URL transfer library, supporting FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP. libcurl supports HTTPS certificates, HTTP POST, HTTP PUT, FTP uploading, kerberos, HTTP form based upload, proxies, cookies, user+password authentication, file transfer resume, http proxy tunneling and more!"
I try to collect needed variables for later install/setup, and
configuration utilities - foolproofing. Also by looking at the nanoconfig script, I may able to do some similar for our project ( the config maybe separate script ?).
I also want to make a table about Proxomitron's 'commands', and maybe possible directions to php equivalent.
It looks that PHP is bloody strong, and have a lot of existing code to only apply with ( like html2xml convert - wap gateway ) to do about every imaginated task ( and some we even unable to imagine ). Like PHP regexes can match again hexa-codes - binaries! On-the-fly patching :-)? Just wonderful.
More about your code ( possibly dumb questions ):
$sOutput is what our client get on default.
in the place of $sOutput We may have separate variables for each header-string ( or it is collecting them from somewhere else? This is the collector of our Header filters output. ).
*Buffers - what I know about them are overflow.
Idea about exploits?
Defenses - filter for content&lentgh?
( On the other side possible offensive/hostile functions to
really bad guys/spamservers - extra size/content headers/data? )
Needs to be change to accepting ranges(*) in client IP address/port setting?
We need this function or stay more secure without it?
$nMaxClients=10 - if we want to stay 'secure', not better to leave this around 1 by default? If our client download a page in 3-4 thread, it means 3-4 client?
Continue broken download and/or save broken stuff - if we are lazy/don't want too much coding check out existing solutions ( my present cache DELETE them, what we don't need - a half file is better than nothing. )
So IMO it must to write to disk the data from net/from its buffer if the connection dies.
Also ( if the cache and proxy built together ) there must be a setting to put/plug any filter to the cache's outer ( most headers and 100% trusted web-filters ) or inner side ( experimental filters ?) ( we don't want accidentaly overfiltered trash in our cache - for those enough if we see them in our client's output ).
have
|
What do you plan about the non-blocking bug on win32? (n/t) (26/03/04 17:50:19)
| |
Kriton
|
Re: What do you plan about the non-blocking bug on win32? (26/03/04 18:47:43)
| |
What exactly do you mean?
phproxy (or Philtron or whatever) uses a polling approach:
Everything runs in a single thread, and from time to time the sockets are polled to see if they are ready for reading or writing, and the corresponding action is taken out.
HTH, if not feel free to ask more :)
Mordred
|
Re: Re: Server module: real code candidate (31/01/04 11:21:29)
| |
Works fine under linux (SuSE 8.3) / php 4.3.4 / Netscape browser configured with proxy = localhost:9000
Where do you plan to insert the process for fetching the requested page (and the eventual filters) ? At first I would say in CClient, but this doesn't fit with the socket_select() blocking call, so i believe it has to be handle by CServer:Loop() and use the same socket mechanism (but as client this time, not server).
Maybe the name CClient is confusing as it may be understood as the client side of the proxy (when the proxy connect to remote server). May I suggest something like 'CRemoteClient' ?
Just my 2 cents. Keep up the great work !
Laurent
|
Some more code + answers (31/01/04 17:20:16)
| |
The source got somewhat big, so I will not post it here.
Get it here:
http://smrt.host.sk/server.php.txt
What's new:
A prototype interface for a CFetcher-s (that's [CM] client module in my silly design doc). I know it's yet not right, but one has to step on several hoes before getting it right :/
The idea of having an abstract 'fetcher' is this: you fancy libcurl over Snoopy? No problem - write your CLibCurlFetcher wrapper and it plugs seamlessly into the system. Right now there are two fetchers - CFileFetcher, that downloads files through the PHP fopen_wrappers (i.e. saying fopen('http://www.google.com','r'). I changed the things a bit and it won't work right now (not without some persuation at least). (Btw I don't know if libcurl can be used here - if it cannot be persuaded to run without blocking)
The other is a CBlockingFetcher, which 'blocks' when ordered to download a file, i.e. it waits for the transfer to finish before returning execution to the script. Again, it is only for testing purposes, the real fetchers cannot behave like that, but will have to be able to download concurrently.
have:
As this code is written, it doesn't need forking anymore. It continually polls the sockets if any of them is ready to read or write, and then reads or writes from it. They say, that forking is very memory-expensive, etc. stuff I don't fully understand. I have little experience with such low-level programming and more or less hack my way through the manual. Maybe we should talk to someone who has a better understanding of how sockets work and are best to be used for our purpose.
(And hey, there are no dumb questions, although there are dumb answers :)
About keeping headers in string or array - when I come to code that needs them in an array, I'll think of a proper interface for it. These are very early steps and I just ignore those things that are not needed immediately. I'm making a note of it for future work though :)
Buffers: afaik in PHP there is no danger of buffer overflowing, as it resizes it's strings dynamically (as opposed to a "char pBuffer[1024];" in C).
About kicking back on hostile behaviour - maybe we can, but that's a very high-level matter, which I have no experience with. Ah, no, correction, I have once tried to do a DNS zone transfer with searchlores.org, heehee :))
We are yet to devise a system for accepting or rejecting remote connections - any ideas are very welcome. Maybe you could check how its implemented in nanoweb and/or apache?
About caching, and broken downloads - I hope that your experiments with nanoweb's mod_proxy will prove fruitful, and you'll write/adjust a proper caching module
Laurent:
Yes, the names were somewhat vague, that's why I called the things that go and fetch the page the client requested 'Fetchers'. The client connects to the server, which sends it's fetchers to download the pages that the client requested.
The fetchers are separate from the server, so one can easily change/try differently working ones, and compare them. Maybe at the end we can settle for one fetcher, but in-between we can test various ones. Right now they are linked to the clients (each client receives one fetcher to bring documents for him, nice doggie, bring the paper :), but this may easiy change should the need arise.
Fravia+:
Aye, crumbles are important, esp. if one wants someone else to follow his zigzags through the dark forest ;) After things with the framework more or less settle down, I promise to sit down and document everything, but right now it's like writing on floating sands. And what d'you know, we might even write a nice post mortem one day, instead of ... ugh ... mores mementae(? ;) ).
Too much foreign languages for today, I'm going to a friend's birthday party!
Have a nice evening, everyone!
Mordred
|
will work only for www.google.com (hardcoded ip) (n/t) (31/01/04 17:24:43)
| |
forgot
|
Question (02/02/04 20:43:52)
| |
Does anyone know, or can find, or can think of a way to find out, how big is a socket.
That is - how much memory it takes, what kind of structure it is?
Is it a mere integer-size ID, or a larger structure?
My idea: Use memory_get_usage() before and after creating, connecting, reading,writing and closing a socket and see how the memory changes.
All I get is "Call to undefined function: memory_get_usage()" :(
Please, someone help with this one, there are important design decisions that depend on this.
Mordred
|
Re: Question (02/02/04 21:17:28)
| |
A quick log based on your real code candidate script :
memory use at point before creating new CServer MEMORY USAGE (% KB PID ): 0.3 1616 14585
memory use at point after creating new CServer MEMORY USAGE (% KB PID ): 0.3 1640 14585
memory use at point after Server->listen MEMORY USAGE (% KB PID ): 0.3 1652 14585
memory use at point before add_client MEMORY USAGE (% KB PID ): 0.3 1652 14585
memory use at point after add_client MEMORY USAGE (% KB PID ): 0.3 1656 14585
memory use at point before pClient->Send MEMORY USAGE (% KB PID ): 0.3 1660 14585
memory use at point after pClient->Send MEMORY USAGE (% KB PID ): 0.3 1660 14585
memory use at point before pClient->Send MEMORY USAGE (% KB PID ): 0.3 1676 14585
memory use at point after pClient->Send MEMORY USAGE (% KB PID ): 0.3 1676 14585
made by calling
function memory_use($st)
{
global $fLog;
$st = "memory use at point $st ";
$my_pid = getmypid();
$st.="MEMORY USAGE (% KB PID ): ".`ps -eo%mem,rss,pid | grep $my_pid`;
$st.= "\n";
echo($st);
fwrite($fLog, $st);
}
at different location. No more time for testing now, but you can send me a test script if you like. Just include calls to 'memory_use("whatever");" when you want the memory use to be traced.
ps : code based on php manual "user entry note"
Laurent
|
Re: Re: Question (03/02/04 13:15:30)
| |
Umm, this output is in KBytes, too inacurate for these purposes :(
Still, I think I found what I needed here:
http://lxr.php.net/source/php-src/ext/sockets/php_sockets.h#73
it says:
73 #ifndef PHP_WIN32
74 typedef int PHP_SOCKET;
75 #else
76 typedef SOCKET PHP_SOCKET;
77 #endif
78
79 typedef struct {
80 PHP_SOCKET bsd_socket;
81 int type;
82 int error;
83 } php_socket;
Which means a socket is 3 ints (and all socket functions
http://lxr.php.net/source/php-src/ext/sockets/sockets.c
use local buffers, which are NOT associated with the 'socket' object)
Mordred
|
It lives :) (04/02/04 21:27:06)
| |
http://smrt.host.sk/phproxy/phproxy.0.1.3.zip
browsing is a bit slow, and there are unresolved issues with some sockets not closing properly (check the logs), but at least it works :)
at the moment there are no hooks for plugins (but you're free to experiment where is the right place for it)
should support GET-ting (google works :), but not POST-ing (posting here won't work). Timeout is set to 60 secs, you may want to increase it for heavier testing.
I'll be reading for a couple of exams, so I may be off until 13th :/ Wish me luck, and test this script a bit
Mordred
|
|
|