1 文章目录 2 1.解析库 3 2.基本使用 4 3.标签选择器 5 3.1选择元素 6 3.2获取名称 7 3.3获取属性 8 3.4获取内容 9 3.5嵌套选择 10 3.6子节点和子孙节点 11 3.7父节点和祖先节点 12 3.8兄弟节点 13 4标准选择器 14 4.1find_all( name , attrs , recursive , text , **kwargs ) 15 4.1.1name 16 4.1.2attrs 17 4.1.3text 18 4.2find( name , attrs , recursive , text , **kwargs ) 19 4.3find_parents() find_parent() 20 4.4find_next_siblings() find_next_sibling() 21 4.5find_previous_siblings() find_previous_sibling() 22 4.6find_all_next() find_next() 23 4.7find_all_previous() 和 find_previous() 24 5.CSS选择器 25 5.1获取属性 26 5.2获取内容 27 6.总结 28 1.解析库 29 灵活又方便的网页解析库,处理高效,支持多种解析器。 30 利用它不用编写正则表达式即可方便地实现网页信息的提取。 31 安装:pip3 install BeautifulSoup4 32 33 解析器 使用方法 优势 劣势 34 Python标准库 BeautifulSoup(markup, “html.parser”) Python的内置标准库、执行速度适中 、文档容错能力强 Python 2.7.3 or 3.2.2)前的版本中文容错能力差 35 lxml HTML 解析器 BeautifulSoup(markup, “lxml”) 速度快、文档容错能力强 需要安装C语言库 36 lxml XML 解析器 BeautifulSoup(markup, “xml”) 速度快、唯一支持XML的解析器 需要安装C语言库 37 html5lib BeautifulSoup(markup, “html5lib”) 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 速度慢、不依赖外部扩展 38 2.基本使用 39 html = """ 40 <html><head><title>The Dormouse's story</title></head> 41 <body> 42 <p class="title" name="dromouse"><b>The Dormouse's story</b></p> 43 <p class="story">Once upon a time there were three little sisters; and their names were 44 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, 45 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 46 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 47 and they lived at the bottom of a well.</p> 48 <p class="story">...</p> 49 """ 50 from bs4 import BeautifulSoup 51 soup = BeautifulSoup(html, 'lxml') 52 print(soup.prettify()) 53 print(soup.title.string) 54 1 55 2 56 3 57 4 58 5 59 6 60 7 61 8 62 9 63 10 64 11 65 12 66 13 67 14 68 15 69 <html> 70 <head> 71 <title> 72 The Dormouse's story 73 </title> 74 </head> 75 <body> 76 <p class="title" name="dromouse"> 77 <b> 78 The Dormouse's story 79 </b> 80 </p> 81 <p class="story"> 82 Once upon a time there were three little sisters; and their names were 83 <a class="sister" href="http://example.com/elsie" id="link1"> 84 <!-- Elsie --> 85 </a> 86 , 87 <a class="sister" href="http://example.com/lacie" id="link2"> 88 Lacie 89 </a> 90 and 91 <a class="sister" href="http://example.com/tillie" id="link3"> 92 Tillie 93 </a> 94 ; 95 and they lived at the bottom of a well. 96 </p> 97 <p class="story"> 98 ... 99 </p> 100 </body> 101 </html> 102 The Dormouse's story 103 1 104 2 105 3 106 4 107 5 108 6 109 7 110 8 111 9 112 10 113 11 114 12 115 13 116 14 117 15 118 16 119 17 120 18 121 19 122 20 123 21 124 22 125 23 126 24 127 25 128 26 129 27 130 28 131 29 132 30 133 31 134 32 135 33 136 34 137 3.标签选择器 138 3.1选择元素 139 html = """ 140 <html><head><title>The Dormouse's story</title></head> 141 <body> 142 <p class="title" name="dromouse"><b>The Dormouse's story</b></p> 143 <p class="story">Once upon a time there were three little sisters; and their names were 144 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, 145 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 146 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 147 and they lived at the bottom of a well.</p> 148 <p class="story">...</p> 149 """ 150 from bs4 import BeautifulSoup 151 soup = BeautifulSoup(html, 'lxml') 152 print(soup.title) 153 print(type(soup.title)) 154 print(soup.head) 155 print(soup.p) 156 1 157 2 158 3 159 4 160 5 161 6 162 7 163 8 164 9 165 10 166 11 167 12 168 13 169 14 170 15 171 16 172 17 173 <title>The Dormouse's story</title> 174 <class 'bs4.element.Tag'> 175 <head><title>The Dormouse's story</title></head> 176 <p class="title" name="dromouse"><b>The Dormouse's story</b></p> 177 1 178 2 179 3 180 4 181 3.2获取名称 182 html = """ 183 <html><head><title>The Dormouse's story</title></head> 184 <body> 185 <p class="title" name="dromouse"><b>The Dormouse's story</b></p> 186 <p class="story">Once upon a time there were three little sisters; and their names were 187 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, 188 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 189 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 190 and they lived at the bottom of a well.</p> 191 <p class="story">...</p> 192 """ 193 from bs4 import BeautifulSoup 194 soup = BeautifulSoup(html, 'lxml') 195 print(soup.title.name) 196 1 197 2 198 3 199 4 200 5 201 6 202 7 203 8 204 9 205 10 206 11 207 12 208 13 209 14 210 title 211 1 212 3.3获取属性 213 html = """ 214 <html><head><title>The Dormouse's story</title></head> 215 <body> 216 <p class="title" name="dromouse"><b>The Dormouse's story</b></p> 217 <p class="story">Once upon a time there were three little sisters; and their names were 218 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, 219 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 220 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 221 and they lived at the bottom of a well.</p> 222 <p class="story">...</p> 223 """ 224 from bs4 import BeautifulSoup 225 soup = BeautifulSoup(html, 'lxml') 226 print(soup.p.attrs['name']) 227 print(soup.p['name']) 228 1 229 2 230 3 231 4 232 5 233 6 234 7 235 8 236 9 237 10 238 11 239 12 240 13 241 14 242 15 243 dromouse 244 dromouse 245 1 246 2 247 3.4获取内容 248 html = """ 249 <html><head><title>The Dormouse's story</title></head> 250 <body> 251 <p clss="title" name="dromouse"><b>The Dormouse's story</b></p> 252 <p class="story">Once upon a time there were three little sisters; and their names were 253 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, 254 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 255 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 256 and they lived at the bottom of a well.</p> 257 <p class="story">...</p> 258 """ 259 from bs4 import BeautifulSoup 260 soup = BeautifulSoup(html, 'lxml') 261 print(soup.p.string) 262 1 263 2 264 3 265 4 266 5 267 6 268 7 269 8 270 9 271 10 272 11 273 12 274 13 275 14 276 The Dormouse's story 277 1 278 3.5嵌套选择 279 html = """ 280 <html><head><title>The Dormouse's story</title></head> 281 <body> 282 <p class="title" name="dromouse"><b>The Dormouse's story</b></p> 283 <p class="story">Once upon a time there were three little sisters; and their names were 284 <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, 285 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and 286 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; 287 and they lived at the bottom of a well.</p> 288 <p class="story">...</p> 289 """ 290 from bs4 import BeautifulSoup 291 soup = BeautifulSoup(html, 'lxml') 292 print(soup.head.title.string) 293 1 294 2 295 3 296 4 297 5 298 6 299 7 300 8 301 9 302 10 303 11 304 12 305 13 306 14 307 The Dormouse's story 308 1 309 3.6子节点和子孙节点 310 html = """ 311 <html> 312 <head> 313 <title>The Dormouse's story</title> 314 </head> 315 <body> 316 <p class="story"> 317 Once upon a time there were three little sisters; and their names were 318 <a href="http://example.com/elsie" class="sister" id="link1"> 319 <span>Elsie</span> 320 </a> 321 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 322 and 323 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> 324 and they lived at the bottom of a well. 325 </p> 326 <p class="story">...</p> 327 """ 328 from bs4 import BeautifulSoup 329 soup = BeautifulSoup(html, 'lxml') 330 print(soup.p.contents) 331 1 332 2 333 3 334 4 335 5 336 6 337 7 338 8 339 9 340 10 341 11 342 12 343 13 344 14 345 15 346 16 347 17 348 18 349 19 350 20 351 21 352 [' Once upon a time there were three little sisters; and their names were ', <a class="sister" href="http://example.com/elsie" id="link1"> 353 <span>Elsie</span> 354 </a>, ' ', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' and ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, ' and they lived at the bottom of a well. '] 355 1 356 2 357 3 358 html = """ 359 <html> 360 <head> 361 <title>The Dormouse's story</title> 362 </head> 363 <body> 364 <p class="story"> 365 Once upon a time there were three little sisters; and their names were 366 <a href="http://example.com/elsie" class="sister" id="link1"> 367 <span>Elsie</span> 368 </a> 369 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 370 and 371 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> 372 and they lived at the bottom of a well. 373 </p> 374 <p class="story">...</p> 375 """ 376 from bs4 import BeautifulSoup 377 soup = BeautifulSoup(html, 'lxml') 378 print(soup.p.children) 379 for i, child in enumerate(soup.p.children): 380 print(i, child) 381 1 382 2 383 3 384 4 385 5 386 6 387 7 388 8 389 9 390 10 391 11 392 12 393 13 394 14 395 15 396 16 397 17 398 18 399 19 400 20 401 21 402 22 403 23 404 <list_iterator object at 0x1064f7dd8> 405 0 406 Once upon a time there were three little sisters; and their names were 407 408 1 <a class="sister" href="http://example.com/elsie" id="link1"> 409 <span>Elsie</span> 410 </a> 411 2 412 413 3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 414 4 415 and 416 417 5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 418 6 419 and they lived at the bottom of a well. 420 1 421 2 422 3 423 4 424 5 425 6 426 7 427 8 428 9 429 10 430 11 431 12 432 13 433 14 434 15 435 16 436 html = """ 437 <html> 438 <head> 439 <title>The Dormouse's story</title> 440 </head> 441 <body> 442 <p class="story"> 443 Once upon a time there were three little sisters; and their names were 444 <a href="http://example.com/elsie" class="sister" id="link1"> 445 <span>Elsie</span> 446 </a> 447 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 448 and 449 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> 450 and they lived at the bottom of a well. 451 </p> 452 <p class="story">...</p> 453 """ 454 from bs4 import BeautifulSoup 455 soup = BeautifulSoup(html, 'lxml') 456 print(soup.p.descendants) 457 for i, child in enumerate(soup.p.descendants): 458 print(i, child) 459 1 460 2 461 3 462 4 463 5 464 6 465 7 466 8 467 9 468 10 469 11 470 12 471 13 472 14 473 15 474 16 475 17 476 18 477 19 478 20 479 21 480 22 481 23 482 <generator object descendants at 0x10650e678> 483 0 484 Once upon a time there were three little sisters; and their names were 485 486 1 <a class="sister" href="http://example.com/elsie" id="link1"> 487 <span>Elsie</span> 488 </a> 489 2 490 491 3 <span>Elsie</span> 492 4 Elsie 493 5 494 495 6 496 497 7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 498 8 Lacie 499 9 500 and 501 502 10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 503 11 Tillie 504 12 505 and they lived at the bottom of a well. 506 1 507 2 508 3 509 4 510 5 511 6 512 7 513 8 514 9 515 10 516 11 517 12 518 13 519 14 520 15 521 16 522 17 523 18 524 19 525 20 526 21 527 22 528 23 529 24 530 3.7父节点和祖先节点 531 html = """ 532 <html> 533 <head> 534 <title>The Dormouse's story</title> 535 </head> 536 <body> 537 <p class="story"> 538 Once upon a time there were three little sisters; and their names were 539 <a href="http://example.com/elsie" class="sister" id="link1"> 540 <span>Elsie</span> 541 </a> 542 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 543 and 544 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> 545 and they lived at the bottom of a well. 546 </p> 547 <p class="story">...</p> 548 """ 549 from bs4 import BeautifulSoup 550 soup = BeautifulSoup(html, 'lxml') 551 print(soup.a.parent) 552 1 553 2 554 3 555 4 556 5 557 6 558 7 559 8 560 9 561 10 562 11 563 12 564 13 565 14 566 15 567 16 568 17 569 18 570 19 571 20 572 21 573 <p class="story"> 574 Once upon a time there were three little sisters; and their names were 575 <a class="sister" href="http://example.com/elsie" id="link1"> 576 <span>Elsie</span> 577 </a> 578 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 579 and 580 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 581 and they lived at the bottom of a well. 582 </p> 583 1 584 2 585 3 586 4 587 5 588 6 589 7 590 8 591 9 592 10 593 html = """ 594 <html> 595 <head> 596 <title>The Dormouse's story</title> 597 </head> 598 <body> 599 <p class="story"> 600 Once upon a time there were three little sisters; and their names were 601 <a href="http://example.com/elsie" class="sister" id="link1"> 602 <span>Elsie</span> 603 </a> 604 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 605 and 606 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> 607 and they lived at the bottom of a well. 608 </p> 609 <p class="story">...</p> 610 """ 611 from bs4 import BeautifulSoup 612 soup = BeautifulSoup(html, 'lxml') 613 print(list(enumerate(soup.a.parents))) 614 1 615 2 616 3 617 4 618 5 619 6 620 7 621 8 622 9 623 10 624 11 625 12 626 13 627 14 628 15 629 16 630 17 631 18 632 19 633 20 634 21 635 [(0, <p class="story"> 636 Once upon a time there were three little sisters; and their names were 637 <a class="sister" href="http://example.com/elsie" id="link1"> 638 <span>Elsie</span> 639 </a> 640 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 641 and 642 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 643 and they lived at the bottom of a well. 644 </p>), (1, <body> 645 <p class="story"> 646 Once upon a time there were three little sisters; and their names were 647 <a class="sister" href="http://example.com/elsie" id="link1"> 648 <span>Elsie</span> 649 </a> 650 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 651 and 652 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 653 and they lived at the bottom of a well. 654 </p> 655 <p class="story">...</p> 656 </body>), (2, <html> 657 <head> 658 <title>The Dormouse's story</title> 659 </head> 660 <body> 661 <p class="story"> 662 Once upon a time there were three little sisters; and their names were 663 <a class="sister" href="http://example.com/elsie" id="link1"> 664 <span>Elsie</span> 665 </a> 666 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 667 and 668 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 669 and they lived at the bottom of a well. 670 </p> 671 <p class="story">...</p> 672 </body></html>), (3, <html> 673 <head> 674 <title>The Dormouse's story</title> 675 </head> 676 <body> 677 <p class="story"> 678 Once upon a time there were three little sisters; and their names were 679 <a class="sister" href="http://example.com/elsie" id="link1"> 680 <span>Elsie</span> 681 </a> 682 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 683 and 684 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a> 685 and they lived at the bottom of a well. 686 </p> 687 <p class="story">...</p> 688 </body></html>)] 689 1 690 2 691 3 692 4 693 5 694 6 695 7 696 8 697 9 698 10 699 11 700 12 701 13 702 14 703 15 704 16 705 17 706 18 707 19 708 20 709 21 710 22 711 23 712 24 713 25 714 26 715 27 716 28 717 29 718 30 719 31 720 32 721 33 722 34 723 35 724 36 725 37 726 38 727 39 728 40 729 41 730 42 731 43 732 44 733 45 734 46 735 47 736 48 737 49 738 50 739 51 740 52 741 53 742 54 743 3.8兄弟节点 744 html = """ 745 <html> 746 <head> 747 <title>The Dormouse's story</title> 748 </head> 749 <body> 750 <p class="story"> 751 Once upon a time there were three little sisters; and their names were 752 <a href="http://example.com/elsie" class="sister" id="link1"> 753 <span>Elsie</span> 754 </a> 755 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 756 and 757 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a> 758 and they lived at the bottom of a well. 759 </p> 760 <p class="story">...</p> 761 """ 762 from bs4 import BeautifulSoup 763 soup = BeautifulSoup(html, 'lxml') 764 print(list(enumerate(soup.a.next_siblings))) 765 print(list(enumerate(soup.a.previous_siblings))) 766 1 767 2 768 3 769 4 770 5 771 6 772 7 773 8 774 9 775 10 776 11 777 12 778 13 779 14 780 15 781 16 782 17 783 18 784 19 785 20 786 21 787 22 788 [(0, ' '), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' and '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, ' and they lived at the bottom of a well. ')] 789 [(0, ' Once upon a time there were three little sisters; and their names were ')] 790 1 791 2 792 4标准选择器 793 4.1find_all( name , attrs , recursive , text , **kwargs ) 794 可根据标签名、属性、内容查找文档 795 796 4.1.1name 797 html=''' 798 <div class="panel"> 799 <div class="panel-heading"> 800 <h4>Hello</h4> 801 </div> 802 <div class="panel-body"> 803 <ul class="list" id="list-1"> 804 <li class="element">Foo</li> 805 <li class="element">Bar</li> 806 <li class="element">Jay</li> 807 </ul> 808 <ul class="list list-small" id="list-2"> 809 <li class="element">Foo</li> 810 <li class="element">Bar</li> 811 </ul> 812 </div> 813 </div> 814 ''' 815 from bs4 import BeautifulSoup 816 soup = BeautifulSoup(html, 'lxml') 817 print(soup.find_all('ul')) 818 print(type(soup.find_all('ul')[0])) 819 1 820 2 821 3 822 4 823 5 824 6 825 7 826 8 827 9 828 10 829 11 830 12 831 13 832 14 833 15 834 16 835 17 836 18 837 19 838 20 839 21 840 22 841 [<ul class="list" id="list-1"> 842 <li class="element">Foo</li> 843 <li class="element">Bar</li> 844 <li class="element">Jay</li> 845 </ul>, <ul class="list list-small" id="list-2"> 846 <li class="element">Foo</li> 847 <li class="element">Bar</li> 848 </ul>] 849 <class 'bs4.element.Tag'> 850 1 851 2 852 3 853 4 854 5 855 6 856 7 857 8 858 9 859 html=''' 860 <div class="panel"> 861 <div class="panel-heading"> 862 <h4>Hello</h4> 863 </div> 864 <div class="panel-body"> 865 <ul class="list" id="list-1"> 866 <li class="element">Foo</li> 867 <li class="element">Bar</li> 868 <li class="element">Jay</li> 869 </ul> 870 <ul class="list list-small" id="list-2"> 871 <li class="element">Foo</li> 872 <li class="element">Bar</li> 873 </ul> 874 </div> 875 </div> 876 ''' 877 from bs4 import BeautifulSoup 878 soup = BeautifulSoup(html, 'lxml') 879 for ul in soup.find_all('ul'): 880 print(ul.find_all('li')) 881 1 882 2 883 3 884 4 885 5 886 6 887 7 888 8 889 9 890 10 891 11 892 12 893 13 894 14 895 15 896 16 897 17 898 18 899 19 900 20 901 21 902 22 903 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] 904 [<li class="element">Foo</li>, <li class="element">Bar</li>] 905 1 906 2 907 4.1.2attrs 908 html=''' 909 <div class="panel"> 910 <div class="panel-heading"> 911 <h4>Hello</h4> 912 </div> 913 <div class="panel-body"> 914 <ul class="list" id="list-1" name="elements"> 915 <li class="element">Foo</li> 916 <li class="element">Bar</li> 917 <li class="element">Jay</li> 918 </ul> 919 <ul class="list list-small" id="list-2"> 920 <li class="element">Foo</li> 921 <li class="element">Bar</li> 922 </ul> 923 </div> 924 </div> 925 ''' 926 from bs4 import BeautifulSoup 927 soup = BeautifulSoup(html, 'lxml') 928 print(soup.find_all(attrs={'id': 'list-1'})) 929 print(soup.find_all(attrs={'name': 'elements'})) 930 1 931 2 932 3 933 4 934 5 935 6 936 7 937 8 938 9 939 10 940 11 941 12 942 13 943 14 944 15 945 16 946 17 947 18 948 19 949 20 950 21 951 22 952 [<ul class="list" id="list-1" name="elements"> 953 <li class="element">Foo</li> 954 <li class="element">Bar</li> 955 <li class="element">Jay</li> 956 </ul>] 957 [<ul class="list" id="list-1" name="elements"> 958 <li class="element">Foo</li> 959 <li class="element">Bar</li> 960 <li class="element">Jay</li> 961 </ul>] 962 1 963 2 964 3 965 4 966 5 967 6 968 7 969 8 970 9 971 10 972 html=''' 973 <div class="panel"> 974 <div class="panel-heading"> 975 <h4>Hello</h4> 976 </div> 977 <div class="panel-body"> 978 <ul class="list" id="list-1"> 979 <li class="element">Foo</li> 980 <li class="element">Bar</li> 981 <li class="element">Jay</li> 982 </ul> 983 <ul class="list list-small" id="list-2"> 984 <li class="element">Foo</li> 985 <li class="element">Bar</li> 986 </ul> 987 </div> 988 </div> 989 ''' 990 from bs4 import BeautifulSoup 991 soup = BeautifulSoup(html, 'lxml') 992 print(soup.find_all(id='list-1')) 993 print(soup.find_all(class_='element')) 994 1 995 2 996 3 997 4 998 5 999 6 1000 7 1001 8 1002 9 1003 10 1004 11 1005 12 1006 13 1007 14 1008 15 1009 16 1010 17 1011 18 1012 19 1013 20 1014 21 1015 22 1016 [<ul class="list" id="list-1"> 1017 <li class="element">Foo</li> 1018 <li class="element">Bar</li> 1019 <li class="element">Jay</li> 1020 </ul>] 1021 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] 1022 1 1023 2 1024 3 1025 4 1026 5 1027 6 1028 4.1.3text 1029 html=''' 1030 <div class="panel"> 1031 <div class="panel-heading"> 1032 <h4>Hello</h4> 1033 </div> 1034 <div class="panel-body"> 1035 <ul class="list" id="list-1"> 1036 <li class="element">Foo</li> 1037 <li class="element">Bar</li> 1038 <li class="element">Jay</li> 1039 </ul> 1040 <ul class="list list-small" id="list-2"> 1041 <li class="element">Foo</li> 1042 <li class="element">Bar</li> 1043 </ul> 1044 </div> 1045 </div> 1046 ''' 1047 from bs4 import BeautifulSoup 1048 soup = BeautifulSoup(html, 'lxml') 1049 print(soup.find_all(text='Foo')) 1050 1 1051 2 1052 3 1053 4 1054 5 1055 6 1056 7 1057 8 1058 9 1059 10 1060 11 1061 12 1062 13 1063 14 1064 15 1065 16 1066 17 1067 18 1068 19 1069 20 1070 21 1071 ['Foo', 'Foo'] 1072 1 1073 4.2find( name , attrs , recursive , text , **kwargs ) 1074 find返回单个元素,find_all返回所有元素 1075 1076 html=''' 1077 <div class="panel"> 1078 <div class="panel-heading"> 1079 <h4>Hello</h4> 1080 </div> 1081 <div class="panel-body"> 1082 <ul class="list" id="list-1"> 1083 <li class="element">Foo</li> 1084 <li class="element">Bar</li> 1085 <li class="element">Jay</li> 1086 </ul> 1087 <ul class="list list-small" id="list-2"> 1088 <li class="element">Foo</li> 1089 <li class="element">Bar</li> 1090 </ul> 1091 </div> 1092 </div> 1093 ''' 1094 from bs4 import BeautifulSoup 1095 soup = BeautifulSoup(html, 'lxml') 1096 print(soup.find('ul')) 1097 print(type(soup.find('ul'))) 1098 print(soup.find('page')) 1099 1 1100 2 1101 3 1102 4 1103 5 1104 6 1105 7 1106 8 1107 9 1108 10 1109 11 1110 12 1111 13 1112 14 1113 15 1114 16 1115 17 1116 18 1117 19 1118 20 1119 21 1120 22 1121 23 1122 <ul class="list" id="list-1"> 1123 <li class="element">Foo</li> 1124 <li class="element">Bar</li> 1125 <li class="element">Jay</li> 1126 </ul> 1127 <class 'bs4.element.Tag'> 1128 None 1129 1 1130 2 1131 3 1132 4 1133 5 1134 6 1135 7 1136 4.3find_parents() find_parent() 1137 find_parents()返回所有祖先节点,find_parent()返回直接父节点。 1138 1139 4.4find_next_siblings() find_next_sibling() 1140 find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。 1141 1142 4.5find_previous_siblings() find_previous_sibling() 1143 find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。 1144 1145 4.6find_all_next() find_next() 1146 find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点 1147 1148 4.7find_all_previous() 和 find_previous() 1149 find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点 1150 1151 5.CSS选择器 1152 通过select()直接传入CSS选择器即可完成选择 1153 1154 html=''' 1155 <div class="panel"> 1156 <div class="panel-heading"> 1157 <h4>Hello</h4> 1158 </div> 1159 <div class="panel-body"> 1160 <ul class="list" id="list-1"> 1161 <li class="element">Foo</li> 1162 <li class="element">Bar</li> 1163 <li class="element">Jay</li> 1164 </ul> 1165 <ul class="list list-small" id="list-2"> 1166 <li class="element">Foo</li> 1167 <li class="element">Bar</li> 1168 </ul> 1169 </div> 1170 </div> 1171 ''' 1172 from bs4 import BeautifulSoup 1173 soup = BeautifulSoup(html, 'lxml') 1174 print(soup.select('.panel .panel-heading')) 1175 print(soup.select('ul li')) 1176 print(soup.select('#list-2 .element')) 1177 print(type(soup.select('ul')[0])) 1178 1 1179 2 1180 3 1181 4 1182 5 1183 6 1184 7 1185 8 1186 9 1187 10 1188 11 1189 12 1190 13 1191 14 1192 15 1193 16 1194 17 1195 18 1196 19 1197 20 1198 21 1199 22 1200 23 1201 24 1202 [<div class="panel-heading"> 1203 <h4>Hello</h4> 1204 </div>] 1205 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>] 1206 [<li class="element">Foo</li>, <li class="element">Bar</li>] 1207 <class 'bs4.element.Tag'> 1208 1 1209 2 1210 3 1211 4 1212 5 1213 6 1214 html=''' 1215 <div class="panel"> 1216 <div class="panel-heading"> 1217 <h4>Hello</h4> 1218 </div> 1219 <div class="panel-body"> 1220 <ul class="list" id="list-1"> 1221 <li class="element">Foo</li> 1222 <li class="element">Bar</li> 1223 <li class="element">Jay</li> 1224 </ul> 1225 <ul class="list list-small" id="list-2"> 1226 <li class="element">Foo</li> 1227 <li class="element">Bar</li> 1228 </ul> 1229 </div> 1230 </div> 1231 ''' 1232 from bs4 import BeautifulSoup 1233 soup = BeautifulSoup(html, 'lxml') 1234 for ul in soup.select('ul'): 1235 print(ul.select('li')) 1236 1 1237 2 1238 3 1239 4 1240 5 1241 6 1242 7 1243 8 1244 9 1245 10 1246 11 1247 12 1248 13 1249 14 1250 15 1251 16 1252 17 1253 18 1254 19 1255 20 1256 21 1257 22 1258 [<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>] 1259 [<li class="element">Foo</li>, <li class="element">Bar</li>] 1260 1 1261 2 1262 5.1获取属性 1263 html=''' 1264 <div class="panel"> 1265 <div class="panel-heading"> 1266 <h4>Hello</h4> 1267 </div> 1268 <div class="panel-body"> 1269 <ul class="list" id="list-1"> 1270 <li class="element">Foo</li> 1271 <li class="element">Bar</li> 1272 <li class="element">Jay</li> 1273 </ul> 1274 <ul class="list list-small" id="list-2"> 1275 <li class="element">Foo</li> 1276 <li class="element">Bar</li> 1277 </ul> 1278 </div> 1279 </div> 1280 ''' 1281 from bs4 import BeautifulSoup 1282 soup = BeautifulSoup(html, 'lxml') 1283 for ul in soup.select('ul'): 1284 print(ul['id']) 1285 print(ul.attrs['id']) 1286 1 1287 2 1288 3 1289 4 1290 5 1291 6 1292 7 1293 8 1294 9 1295 10 1296 11 1297 12 1298 13 1299 14 1300 15 1301 16 1302 17 1303 18 1304 19 1305 20 1306 21 1307 22 1308 23 1309 list-1 1310 list-1 1311 list-2 1312 list-2 1313 1 1314 2 1315 3 1316 4 1317 5.2获取内容 1318 html=''' 1319 <div class="panel"> 1320 <div class="panel-heading"> 1321 <h4>Hello</h4> 1322 </div> 1323 <div class="panel-body"> 1324 <ul class="list" id="list-1"> 1325 <li class="element">Foo</li> 1326 <li class="element">Bar</li> 1327 <li class="element">Jay</li> 1328 </ul> 1329 <ul class="list list-small" id="list-2"> 1330 <li class="element">Foo</li> 1331 <li class="element">Bar</li> 1332 </ul> 1333 </div> 1334 </div> 1335 ''' 1336 from bs4 import BeautifulSoup 1337 soup = BeautifulSoup(html, 'lxml') 1338 for li in soup.select('li'): 1339 print(li.get_text()) 1340 1 1341 2 1342 3 1343 4 1344 5 1345 6 1346 7 1347 8 1348 9 1349 10 1350 11 1351 12 1352 13 1353 14 1354 15 1355 16 1356 17 1357 18 1358 19 1359 20 1360 21 1361 22 1362 Foo 1363 Bar 1364 Jay 1365 Foo 1366 Bar
https://blog.csdn.net/qq_42554007/article/details/90675142