设为首页收藏本站

 找回密码
 注册

QQ登录

只需一步,快速开始

查看: 335|回复: 1

Retrieve Data from a Remote Webpage - PHP

[复制链接]
发表于 2013-2-21 14:00:00 | 显示全部楼层 |阅读模式
PHP’s file() functions are great for opening, reading, writing to, and doing other dirty tricks with files. Any html or php page is of course a text file so we can open them and extract data in many different ways. Here are just a couple.


Retrieving meta tags from a remote webpage
PHP has a useful function called get_meta_tags that allows you to read meta data as an array and then extract certain elements.
As an example, the following code will snatch the meta data from www.waypoints.ws.
  1. $tags = get_meta_tags('http://www.waypoints.ws/');
  2. $keywords = $tags['keywords'];
  3. $description = $tags['description'];
  4. $author = $tags['author'];

  5. echo "<b>Description:</b> $description
  6. ";
  7. echo "<b>Keywords:</b> $keywords
  8. ";
  9. echo "<b>Author:</b> $author
  10. ";
复制代码
Will return this:
I use a variant of the get_meta_tags function to display page information as part of a preview function for fat.ly short URL service.


Retrieving title tags from a remote webpage
The get_meta_tags function will only parse data above the closing head tag in your page so it’s relatively quick. The function won’t read the title tag of your page – so to do that we open up the entire page and extract the desired text. This process is far more time consuming.
First, we open the page for reading. I’m using fopen() but you could just as easily use file_get_contents().
  1. <?php
  2. $url = "http://www.google.com.au/";
  3. $page = fopen($url, 'r');
  4. $content = "";
  5. while( !feof( $page ) ) {
  6. $buffer = trim( fgets( $page, 4096 ) );
  7. $content .= $buffer;
  8. }
  9. ?>
复制代码
Second, we extract the text between the title tags.
  1. eregi('',$content,$tmp);
  2. $result = ereg_replace('[[:blank:]]',' ',$tmp[1]);
  3. echo "$result";
复制代码
Retrieving any text data from a remote webpage
Using the same code above, we could extract any text element of a page between two defined and unique strings of text. This means you that you can effectively snatch portions of remote web pages for inclusion into your own site. Before you did anything remotely resembling this you would should ensure you have permission to do so. Not doing so is theft.
  1. <?php
  2. $url = "http://www.your-URL-in-here.com/";
  3. $page = fopen($url, 'r');
  4. $content = "";
  5. while( !feof( $page ) ) {
  6. $buffer = trim( fgets( $page, 4096 ) );
  7. $content .= $buffer;
  8. }

  9. $start = "<text1>";
  10. $end = "<\/text2>";

  11. preg_match( "/$start(.*)$end/s", $content, $match);
  12. $mytext = $match[1];
  13. echo "$mytext";
  14. ?>
复制代码
Retrieving header information from a remote webpage
A developer can extract header information from a remote page using the following code:
  1. <?php
  2. $fp = fopen('http://www.google.com', 'r');
  3. // Creates variable $http_response_header
  4. print_r($http_response_header);
  5. // or
  6. $meta_data = stream_get_meta_data($fp);
  7. print_r($meta_data);
  8. ?>
复制代码
internoetics
January 8, 2010  By Marty
发表于 2013-2-22 11:18:16 | 显示全部楼层
本帖最后由 demo 于 2013-2-23 03:21 编辑

Some codes you might need while get data from webpage:
  1. <?php
  2. $url = "http://www.google.com.au/";
  3. $page = fopen($url, 'r');
  4. $content = "";
  5. while( !feof( $page ) ) {
  6. $buffer = trim( fgets( $page, 4096 ) );
  7. $content .= $buffer;
  8. }

  9. echo "$content";
  10. echo "\r\n==========Remove script============\r\n";
  11. $tmp = preg_replace('%<script.*?>.*?</script>%is','|',$content);
  12. $tmp1 = preg_replace('%<style>.*?</style>%is',':',$tmp);
  13. //eregi('',$content,$tmp);
  14. $result = ereg_replace('[[:blank:]]',' ',$tmp);
  15. echo "$tmp1";

  16. echo "\r\n==========Use strip_tags to remove html tags with some exceptions============\r\n";
  17. $tmp = strip_tags($content,'<a><head>');
  18. echo "$tmp";

  19. echo "\r\n==========Text only============\r\n";
  20. // This echoes correctly all the text that is not inside HTML tags
  21. $html_reg = '/<+\s*\/*\s*([A-Z][A-Z0-9]*)\b[^>]*\/*\s*>+/i';
  22. $tmp = preg_replace( $html_reg, '', $content );
  23. //echo htmlentities( preg_replace( $html_reg, '', $html ) );
  24. echo "$tmp";

  25. echo "\r\n==========Get text============\r\n";
  26. /* $tmp = preg_replace('/<script[^>]*?>[^<]*?<\/script>/i','***',$content);
  27. eregi('',$content,$tmp);
  28. * used for bb replace .....
  29. $patterns = array(
  30.             "/\[link\](.*?)\[\/link\]/",
  31.             "/\[url\](.*?)\[\/url\]/",
  32.             "/\[img\](.*?)\[\/img\]/",
  33.             "/\[b\](.*?)\[\/b\]/",
  34.             "/\[u\](.*?)\[\/u\]/",
  35.             "/\[i\](.*?)\[\/i\]/"
  36.         );
  37. $replacements = array(
  38.             "<a href="\\1">\\1</a>",
  39.             "<a href="\\1">\\1</a>",
  40.             "<img src="\\1">",
  41.             "<b>\\1</b>",
  42.             "<u>\\1</u>",
  43.             "<i>\\1</i>"
  44.            
  45.         );
  46. */
  47. $patterns = array(
  48.             "/<a href="(.*?)"(.*?)>(.*?)<\/a>/",
  49.             "/<img src="(.*?)"(.*?)\/>/"
  50.         );
  51. $replacements = array(
  52.             "\r\n\\3 -->http://www.demo.com\\1\r\n",
  53.             "\r\n Img --> http://www.demo.com\\1\r\n"
  54.            
  55.         );
  56. $result = preg_replace($patterns,$replacements, $tmp1);
  57. //$result = ereg_replace('[[:blank:]]',' ',$tmp);
  58. echo "$result";
  59. ?>
复制代码
More flexible code to get link or picture as:
  1. <?php
  2. $url = "http://news.msn.com/";
  3. $page = fopen($url, 'r');
  4. $content = "";
  5. while( !feof( $page ) ) {
  6. $buffer = trim( fgets( $page, 4096 ) );
  7. $content .= $buffer;
  8. }

  9. $start = "<section  id="featured_classic fill > cluster" class="featured "  data-aop="featured_classic fill > cluster">";
  10. $end = "<section id="clusters">";

  11. preg_match( "/$start(.*)$end/s", $content, $match);
  12. $mytext = $match[1];
  13. echo "$mytext\r\nData List \r\n";

  14. $patterns = array(
  15.             "/<a href="(.*?)"(.*?)>(.*?)<\/a>/",
  16.             "/<img(.*?)src="(.*?)"(.*?)\/>/"
  17.         );
  18. $replacements = array(
  19.             "\r\n...\\3 -->http://www.demo.com\\1\r\n",
  20.             "\r\n...Img --> http://www.demo.com\\2\r\n"
  21.            
  22.         );
  23. $result = preg_replace($patterns,$replacements, $mytext);
  24. //$result = ereg_replace('[[:blank:]]',' ',$tmp);
  25. echo "$result";
  26. ?>
复制代码
And this will give you information of the header:
  1. <?php
  2.     $fp = fopen('http://www.google.com', 'r');
  3.     // Creates variable $http_response_header
  4.     print_r($http_response_header);
  5.     // or
  6.     $meta_data = stream_get_meta_data($fp);
  7.     print_r($meta_data);
  8. ?>
复制代码
Pretty cool.

您需要登录后才可以回帖 登录 | 注册

本版积分规则

手机版|小黑屋|BC Morning Website ( Best Deal Inc. 001 )  

GMT-8, 2025-12-12 22:29 , Processed in 0.011915 second(s), 16 queries .

Supported by Best Deal Online X3.5

© 2001-2025 Discuz! Team.

快速回复 返回顶部 返回列表